ESS net on Data Integration

Final Workshop, Madrid 2011

ESS net on Data Integration Final Workshop, Madrid 2011

Table of Contents

Introduction...... 2 Background……………………………………………………………………………………………………………..….………………..2 Workshop……………………………………………………………………………………………………………..…….………………..3 Section I – Record linkage…………………………………………………………………………………………..…….…………….5 Section II – Statistical matching…………………………………………………………………………………..……….………..58 Section III – Data Integration in practice………………………………………………………………..……………………….88

Page 2 of 199

INTRODUCTION BACKGROUND The project "ESSnet on Data Integration" (ESSnet DI), carried out by the NSIs of Italy, the Netherlands, Norway, Poland, Spain and Switzerland, aims at promoting knowledge and practical application of sound statistical methods for the joint use of existing data sources in the production of official statistics, and at disseminating this knowledge within the ESS. This project continues the efforts already performed in the “ESSnet on Statistical Methodologies: area integration of surveys and administrative data” (ESSnet on ISAD). Cooperation on data integration issues at the European level is extremely useful. In fact, data integration is seldom a task of a centralised NSI office. Moreover, knowledge is usually sparse in each NSI, whose production units may make use of different practices, quality evaluations and technical/software tools. At the European level these differentiations are even larger. The project focuses on statistical methods of data integration. In the European panorama, the development of data integration methods follows these two broad lines: 1. The necessity to jointly exploit two or more data sources that could be linked with reliable unit identifiers led to a set of methods known as micro integration, whose aim is to ensure that the integrated data set is of good quality (fulfilment of edit checks, timeliness, representativeness,…); 2. The absence (for privacy reasons, lack of quality, use of sample surveys only) of good unit identifiers led some countries to apply and develop alternative data integration approaches that make explicitly use of statistical procedures for the detection of the records to be linked in different data sets: record linkage when the data sets to be linked observe the same sets of units, and statistical matching when two data sets do not have any units in common (e.g. when two data sets are two sample surveys). In order to promote knowledge and application of data integration methods in the ESS, the actions performed during this ESSnet addressed the following issues: 1. To develop and organise common knowledge: this was achieved through the development of papers updating existing state-of-the-art documents (as those developed during the ESSnet on ISAD) and of bibliographies and repositories of papers. 2. To develop methods in specific sub domains: this task was performed in order to address those issues that have not yet been solved. The result consists of 7 research papers that report the improvements in micro integration, record linkage and statistical matching; 3. To provide users with tools for their use: some software tools for data integration (Relais for record linkage and StatMatch for statistical matching) were already available in one member state (Italy). These tools are written with open source software (mainly R) and are freely available. In the present project these tools were improved with new functionalities. Manuals on their use were improved and updated, providing also examples and vignettes for their practical use. Case studies were also produced in the project reporting the steps performed in practice when dealing with a problem related to micro integration, record linkage or statistical matching. 4. To foster knowledge transfer: in addition to case studies, knowledge transfer was ensured by three on-the-job training courses on record linkage and statistical matching, one project course on data integration, one final workshop on data integration, participation at relevant workshops and conferences.

Page 3 of 199

The project output has been uploaded in the ESSnet portal: http://www.essnet-portal.eu/di/data- integration The ESSnet-DI has strongly contributed to the spread of knowledge of statistical methods for data integration in the ESS through all its outputs, as well as to the development of new methodologies and the assessment of a clear framework for micro integration. Nevertheless, while there is still much work to be done the fruitful cooperation with the project partners and academic institutions should be maintained. WORKSHOP The ESSnet on Data Integration workshop (Madrid, 24-25 Nobember 2011) was organized in one introductory session and 6 specialized session. The introductory session was devoted to the ESSnet results. The specialized sessions covered some data integration specific areas: 1. Record Linkage 2. Statistical Matching 3. Micro Integration Processing 4. Practical experience on Data Integration and related domains 5. Register based statistics 6. Integration of administrative data and surveys These proceedings only contain papers of the specialized sessions with the exclusion of session 3. In fact, session 3 refers explicitly to ESSnet experiences already collected in the ESSnet documentation on WP1, WP2 and WP4 on micro integration. The papers in this volume are organized in three chapters: Section I Record linkage 1) Cleaning and using administrative lists: Methods and fast computational algorithms for record linkage and modeling/editing/imputation (William E. Winkler) 2) Hierarchical Bayesian Record Linkage (Brunero Liseo and Andrea Tancredi) 3) Applications of record linkage to population statistics in the UK (Dick Heasman) 4) Integrating registers: Italian business register and patenting enterprises (Daniela Ichim, Giulio Perani and Giovanni Seri) 5) Linking information to the ABS Census of Population and Housing in 2011 (Graeme Thompson) Section II Statistical matching 6) Measuring uncertainty in statistical matching for discrete distributions (Pier Luigi Conti and Daniela Marella) 7) Statistical matching: a case study on EU-SILC and LFS (Aura Leulescu, Mihaela Agafitei and Jean-Louis Mercy) 8) Data Integration Application with Coarsened Exact Matching (Mariana Kotzeva and Roumen Vesselinov)

Page 4 of 199

Section III Data integration in practice 9) Data integration and small domain estimation in Poland – experiences and problems (Elżbieta Gołata) 10) Quality Assessment of register-based Statistics - Preliminary Results for the Austrian Census 2011 (Manuela Lenk) 11) The integration of the Spanish Labour Force Survey with the administrative source of persons with disabilities (Amelia Fresneda) 12) Comparative analysis of different income components between the administrative records and the Living Conditions Survey (José María Méndez) 13) Administrative data as input and auxiliary variables to estimate background data on enterprises in the CVT survey 2011 (Eva Maria Asamer) 14) Transforming administrative data to statistical data using ETL tools –extraction, transformation, loading– (Paulina Kobus and Pawel Murawski) 15) Case study: Job Churn Explorer project at CSO, Ireland (John Dunne) 16) The system of short term business statistics on Labour in Italy. The challenges of data integration (Ciro Baldi, Diego Bellisai, Francesca Ceccato, Silvia Pacini, Laura Serbassi, Marina Sorrentino, Donatella Tuzi) 17) Obtaining statistical information in sampling surveys from administrative sources: Case study of Spanish Labour Force Survey (LFS) variable “wages from the main job” (Honorio Bueno and Javier Orche)

Mauro Scanu ESSnet on Data Integration coordinator Istat, via Depretis 77 Roma, [email protected]

Acknowledgements First of all, I would like to thank the invited speakers (Pier Luigi Conti, Elzbieta Golata, Manuela Lenk, Brunero Liseo, William E. Winkler) and all the workshop speakers for their high-level presentations. Special thanks are due to the workshop host, INE, for their excellent hospitality, and to Eurostat for the constant support and guidance. My personal thanks go to the ESSnet partners. Their professionalism as well as their co-operative spirit allowed the workshop success and the ESSnet outputs. These are their names in alphabetical order: Adam Ambroziak, Bart Bakker, Cristina Casciano, Nicoletta Cibella, Paolo Consolini, Marcello D’Orazio, Marco Di Zio, Gervasio-Luís Fernández Trasobares, Marco Fortini, Johan Fosen, Dehnel Grażyna, Miguel Guigó Pérez, Francisco Hernandez Jimenez, Daniela Ichim, Tomasz Józefowski, Daniel Kilchmann, Tomasz Klimanek, Paul Knottnerus, Jacek Kowalewski, Ewa Kowalka, Léander Kuijvenhoven, Frank Linder, Andrzej Młodak, Nino Mushkudiani, Filippo Oropallo, Artur Owczarkowski, Jeroen Pannekoek, Jan Paradysz, Laura Peci, Francesca Romana Pogelli, Monica Scannapieco, Eric Schulte Nordholt, Jean-Pierre Renfer, Wojciech Roszka, Pietrzak Beata Rynarzewska, Giovanni Seri, Marcin Szymkowiak, Tiziana Tuoto, Luca Valentino, Arnout van Delden, Dominique van Roon, Magdalena Zakrzewska, Li-Chun Zhang.

Page 5 of 199

Section I – Record linkage

Page 6 of 199 Cleaning and using administrative lists: Methods and fast computational algorithms for record linkage and modeling/editing/imputation

William E. Winkler U.S Bureau of the Census, [email protected]

Abstract - Administrative lists offer great opportunity for analyses that provide quantities for policy decisions. This is particularly true when groups of administrative lists are combined with survey and other data. To produce accurate analyses, data need to be cleaned and corrected according to valid subject matter rules. This paper describes methods and associated computational algorithms that, while often being easier to apply, are sometimes 40-100 times as fast as classical methods. This means that moderate-size administrative files can be cleaned (via modeling/edit/imputation) to eliminate contradictory or missing quantitative data to yield valid joint distributions, unduplicated within files, and matched and merged across files in a matter of weeks or months.

Keywords: quality, merging, computational algorithms

1. Introduction

Well collected and processed administrative data can be of great use for providing enhanced aggregates and microdata for analytic purposes. In this paper, we assume that data are collected and processed in a manner that minimizes error. We describe three methods for processing data. The first are modeling/edit/imputation methods for filling in missing data and ‘correcting’ erroneous or contradictory data. The second are record linkage (entity resolution) methods for matching files using common quasi- identifiers such as name, address, date-of-birth, and other characteristics. The third are methods for adjusting analyses of merged files for linkage error. The modeling/edit/imputation methods are based on the theoretical model and suggested algorithms of Fellegi and Holt (1976, hereafter FH). Versions of generalized software for editing and certain types of imputation have been in use in a few statistical agencies for more than ten years. What is new is a rigorous method of theoretically connecting editing with modern imputation such as given in Little and Rubin (2002). Winkler (2003) introduced the theory for discrete data and provided extremely fast computation algorithms (2008, 2010b) in highly automated, parameter-driven software. The set-covering algorithms (Winkler 1997) for enumerating all implicit edits are 100 times as fast as those of IBM based on the ideas of Garfinkel, Kunnathur, and Liepins(1986). The modeling, imputation, and imputation- variance are on the order of 100 times as fast as those in commercial or experimental university software. The record linkage algorithms (Yancey and Winkler 2005-2009, Winkler, Yancey, and Porter 2010) are 40+ times as fast as recent parallel software from Stanford and Penn State (Kawai et al. 2006, Kim and Lee 2007) and 500+ times as software used in some government agencies (e.g., Wright 2010). The analysis-adjustment methods for merged files are still quite preliminary (Scheuren and Winkler 1993, 1997; Lahiri and Larsen 2005, Chambers 2009, Tancredi and Liseo 2011) with the main difficulties being seen as properly creating an overall model of the record linkage process and having suitable generalized methods for adjusting analyses for error. The methods of Chambers (2009) appear to show great promise in drastically simplified record linkage situations and simple simulations but may not extend to the more general and far more realistic situations of Lahiri and Larsen (2005). At issue in all of the work are methods for estimating suitable probabilities of matching for all pairs (typically without training data). Lahiri and Larsen (2005) and Chambers (2009) assume that extremely large resources and time may be

Page 7 of 199 available for follow-up on an exceptionally large number of pairs to determine matching probabilities. Tancredi and Liseo give the most general methods but the computational methods for the MCMC algorithms in their Bayesian analysis are presently only suitable for very small situations. Scheuren and Winkler (1993) made simplifications in the adjustment procedures because they were able to make use of methods due to Belin and Rubin (1995) for estimating match probabilities. A more general method for estimating match probabilities (Winkler 2006) mimics ideas from semi-supervised learning (see e.g. Larsen and Rubin 2001, Winkler 2002, Nigam et al. 2000) but also does not use training data. Although there is bias in the Scheuren-Winkler adjustment, in a number of empirical situations, the bias counter- balances against other biases so that the overall procedure is relatively unbiased. A conceptual picture would link records in file A = (ai, …, an, x1, …, xk) with records in file B = (b1, …, bm, x1, …, xk) using common identifying information (x1, …, xk) to produce the merged file A × B = (ai, …, an, b1, …, bm) for analyses. The variables x1, …, xk are quasi-identifiers such as names, addresses, dates-of-birth, and even fields such as income (when processed and compared in a suitable manner). Individual quasi-identifiers will not uniquely identify correspondence between pairs of records associated with the same entity; sometimes combinations of the quasi-identifiers may uniquely identify. Survey files routinely require cleanup via edit/imputation and administrative files may also require similar cleanup. If there are errors in the linkage, then completely erroneous (b1, …, bm) may be linked with a given (ai, …, an) and the joint distribution of (ai, …, an, b1, …, bm) in A × B may be very seriously compromised. If there is inadequate cleanup (i.e., effective edit/imputation) of A = (ai, …, an, x1, …, xk) and B = (b1, …, bm, x1, …, xk) , then analyses may have other serious errors in addition to the errors due to the linkage errors. The purpose of the paper is to describe the available newer theoretical ideas and new computational algorithms. If we have several administrative lists each with 100 million to one billion records, then the clean-up, merging, and analyses might be performed in 3-4 months with this software that is 40-100 times as fast. Without the faster software, the problem of extensive cleanup, merging, and analysis of sets of large administrative lists is computationally intractable. In the next three sections, we provide background and insight into modeling/edit/imputation, record linkage, and adjustment of analyses for linkage error.

2. Modeling/edit/imputation

In this section we provide background on classical edit/imputation that uses hot-deck and provide a description of how hot-deck was assumed to work by practitioners. As far as we know, there has never been a rigorous development that may justify some of the assumed properties of hot-deck. We also provide background methods of creating loglinear models Y (Bishop, Fienberg and Holland 1975) that are straightforward to apply to general discrete data, background on general methods of imputation and editing for missing data under linear constraints that extend the basic methods and can also be straightforward to apply, and an elementary review of the EM algorithm. The application of the general methods and software is straightforward. The application can be done without any modifications that are specific to a particular data file or analytic use.

2.1 Classical data collection, edit rules, and hot-deck imputation The intent of classical data collection and clean-up was to provide a data file that was free of logical errors and missing data. For a statistical agency, a survey form might be filled out by an interviewer during a face-to-face interview with the respondent. The ‘experienced’ interviewer would often be able to ‘correct’ contradictory data or ‘replace’ missing data during the interview. At a later time analysts might make further ‘corrections’ prior to the data being placed in computer files. The purpose was to produce a ‘complete’ (i.e., no missing values) data file that had no contradictory values in some variables. The final ‘cleaned’ file would be suitable for various statistical analyses. In particular, the statistical file would

Page 8 of 199 allow determination of the proportion of specific values of the multiple variables (i.e., joint inclusion probabilities). Naïvely, dealing with edits is straightforward. If a child of less than sixteen years old is given a marital status of ‘married’, then either the age associated with the child might be changed (i.e., to older than 16) or the marital status might be changed to ‘single’. The difficulty consistently arose that, as a (computerized) record r0 was changed to a different record r1 by changing values in fields in which edits failed, then the new record r1 would fail other edits that the original record r0 had not failed. Fellegi and Holt (1976) were the first to provide an overall model to assure that a changed record r1 would not fail edits. Their theory required the computation of all implicit edits that could be logically derived from an originally specified set of ‘explicit’ edits. If the implicit edits were available, then it was always possible to change an edit-failing record r0 to an edit passing record r1. The availability of ‘implicit’ edits makes it quite straightforward and fast to determine the minimum number of fields to change in an edit- failing record r0 to obtain an edit-passing record r1 (Barcaroli and Venturi 1997). Further, Fellegi and Holt indicated how hot-deck might be used to provide the values for filling in missing values or replacing contradictory values. As shown in Winkler (2008b), hot-deck is not generally suitable for filling in missing values in a manner that yields records that satisfy edits and preserve joint distributions. Indeed, the imputation methods in use at a variety of statistical agencies and those that are also being investigated do not assure that aggregates of records satisfy joint distributions and that individual records satisfy edits. The early set-covering algorithms necessary for the computation of ‘implicit’ edits required extremely large amounts of computer time (Garfinkel, Kunnathur, and Liepins 1986). A later algorithm (Winkler 1997), while as much as 100 times as fast, is not completely theoretically valid but works in most situations where skip patterns are not present in the survey form (see also Winkler and Chen 2002). Due to hardware-speed increases, the latter algorithm should work well in most day-to-day survey situations. Both Winkler (1997) and Boskovitz (2008) provided counterexamples to Theorem 1 in Garfinkel et. al (1986) which gave a method for greatly simplifying the set covering algorithms for implicit-edit generation. Boskovitz (2008) provided a complete theoretical development (including data with skip patterns), however software based on her algorithms has not yet been written and will likely be 10 times as slow due to the significantly greater amount of information that must be accounted for at different levels of the computational algorithms. The intent of filling-in missing or contradictory values in edit-failing records r0 is to obtain a records r1 that can be used in computing the joint probabilities in a principled manner. The difficulty that had been observed by many individuals is that a well-implemented hot-deck does not preserve joint probabilities. Rao (1997) provided a theoretical characterization of why hot-deck fails even in two-dimensional situations. The failure occurs even in ‘nice’ situations where individuals had previously assumed that hot- deck would work well. In a real-world survey situation, subject matter ‘experts’ may develop hundreds or thousands of if-then- else rules that are used for the editing and hot-deck imputation. Because it is exceptionally difficult to develop the logic for such rules, most edit/imputation systems do not assure that records satisfy edits or preserve joint inclusion probabilities. Further, such systems are exceptionally difficult to implement because of (1) logic errors in specifications, (2) errors in computer code, and (3) no effective modeling of hot-deck matching rules. As demonstrated by Winkler (2008b), it is effectively impossible with the methods (classical if-then-else and hot-deck) that many agencies use to develop edit/imputation systems that preserve either joint probabilities or that create records that satisfy edit restraints. This is true even in the situations when Fellegi-Holt methods are used for the editing and hot-deck is used for imputation. An edit/imputation system that effectively uses the edit ideas of Fellegi and Holt (1976) and modern imputation ideas (such as in Little and Rubin 2002) has distinct advantages. First, it is far easier to implement (as demonstrated in Winkler 2008b, also 2010d). Edit rules are in easily modified tables, and the logical consistency of the entire system is tested automatically according the mathematics of the Fellegi-Holt model and additional requirements on the preservation of joint inclusion probabilities (Winkler 2003). Second, the optimization that determines the minimum number of fields to change or replace in an edit-failing record is in a fixed mathematical routine that does not need to change. Third,

Page 9 of 199 imputation is determined from a model (limiting distribution). Most modeling is very straightforward. It is based on variants of loglinear modeling and extensions of missing data methods that is contained in easily applied, extremely fast computational algorithms (Winkler 2006, 2008b; also 2010a). The methods create records that always satisfy edits and preserve joint inclusion probabilities.

2.2 How classical hot-deck is assumed to work In this subsection we provide an explanation of some of the (possibly) subtle issues that significantly degrade the overall analytic characteristics of realistic data files (8 or more variables) that are subjected to well-implemented hot-deck. The reason that the issues may be subtle is that in many situations with hot- deck, the probabilistic model is not written down and the effects of the statistical evaluations (say logistic or ordinary regression) on hot-deck collapsing rules for matching are not evaluated. We will describe why it is effectively impossible in many practical survey situations to do the empirical testing and develop program logic necessary for a well-implemented hot-deck. Prior to this we provide some notation and background that will allow us to describe why hot-deck breaks down in terms of the basic modeling frameworks of Little and Rubin (2002) and Winkler (2003).

We assume X = Xi = (xij), 1 ≤ i ≤ N, 1 ≤ j ≤ M is a representation of the survey data with N rows (records) th and M columns (variables). Record xi has values xij, 1 ≤ j ≤ M. The j variables Xj takes values xjk, 1≤ k ≤ nj. The total number of patterns is npat = n1 × … × nM. In most realistic survey situations (8 or more variables), the number of possible patterns npat is far greater than N (i.e., N << npat). Under classical hot-deck assumptions (that are essentially universally used in statistical agencies), the typical assumption is that we will be able to match a record r0 = (x01, x02, …., x0M) having missing values of certain variables against a large number of donor records that have no missing variables and that agree with record r0 on the non-missing values. If record r0 has eight variables with the last three variables having missing values, then the intent of hot-deck (after it is implemented over an entire file) is to create a set of records that preserve the original probability structure of a hypothetical file X having no missing values. We start with record r0 = (x01, x02, …, x05, b, b, b) where b represents a missing value for x06, x07, and x08. Under the hot-deck assumptions, our matching would effectively draw from the distribution of P(X6, X7, X8 | X1=x01, …, X5=x05). In practice with real-world data, we typically have zero donors (rather than an exceptionally large number that would be needed to preserve joint distributions). Statistical agencies typically use ad hoc collapsing in which they attempt to match on a subset of the values x01, x02, …, x05. For instance, there may be a matching hierarchy in which the first match attempt is on x01, x02, x03. If a donor record is not found matching may be done on x01 and x02. If no donor is found, then matching might be done on only x01 where it might be possible to always find a donor. If we are able to match on x01, x02 and x03, we obtain a record rd = (xd1, …. xd8) that yields a hot-deck completed record r0c = (x01, …, x05, xd6, xd7, xd8). There is no assurance that the substituted values will preserve joint distributions or create a record that satisfies edits. Indeed, elementary empirical work with exceptionally simple simulated data (that should preserve joint distributions under the hot-deck assumption) also demonstrate that joint distributions are not preserved. Although the elementary work uses data situations that are much nicer than many real-world situations, it still fails to yield hot-deck imputations that preserve joint distributions. To preserve joint distributions, it might be necessary to create some type of basic model for collapsing. A simplistic approach might be to use logistic regression to find what subsets of x01, …, x05 are the best predictors of the remaining variables and choose the collapsing hierarchy based on a very large set of logistic regressions. Even after such work (that is very specific to an individual data set), it is not clear why the joint distributions would be preserved. It would be much better to have a general modeling framework (possibly an extension of Little and Rubin (2002), chapter 13) and software that would work for arbitrary discrete data under mild assumptions. One mild assumption is the missing-at-random assumption (Little and Rubin 2002) that is effectively the hot-deck assumption in a framework in which it is possible to preserve joint inclusion probabilities. An effective model might be multinomial (or multinomial with weak Dirichlet prior) that all non-structural-zero cells are given a non-zero (but possibly very close to

Page 10 of 199 zero) values. In this situation (pi), 1 ≤ i ≤ npat are the probabilities of the multinomial with the individual cells, and we have a suitable probability structure. With this extended hot-deck (effectively Little-Rubin ideas), we match against cells that agree with the non-missing part of a record r0 and choose one cell (donor pattern or record) with probability proportional to size of the cell probability.

2.3. New Computational Algorithms for Modeling and Imputation The generalized software (Winkler 2010b) incorporates ideas from statistical matching software (Winkler 2006) that can be compared to ideas and results of D’Orazio et al. (2006) and earlier discrete-data editing software (Winkler 2008b) that could also be used for synthetic-data generation (Winkler 2010a). The basic methods are closely related to ideas suggested in Little and Rubin (2002, Chapter 13) in that they assume a missing-at-random assumption that can be slightly weakened in some situations (Winkler 2008b, 2010a). The original theory for the computational algorithms (Winkler 1993) uses convex constraints (Winkler 1990) to produce an EMH algorithm that generalizes the MCECM algorithm of Meng and Rubin (1993). The EMH algorithm was first applied to record linkage (Winkler 1993) and used by D’Orazio, Di Zio, and Scanu (2006) in statistical matching. The current algorithms do the EM fitting as in Little and Rubin (2002) but with computational enhancements that scale subtotals exceedingly rapidly and with only moderate use of memory. The computational speed for a contingency table of size 600,000 is 50 seconds and for a table of size 0.5 billion cells in approximately 1000 minutes (each with epsilon 10^-12 and 200 iterations). In the larger applications, 16 Gb of memory are required. The key to the speed is the combination of effective indexing of cells and suitable data structures for retrieval of information so that each of the respective margins of the M-step of EM-fitting are computed rapidly. Certain convex constraints can be incorporated in addition to the standard linear constraints of classic loglinear EM fitting. In statistical matching (Winkler 2006c) was able to incorporate closed form constraints P(Variable X1 = x11 > Variable X1= x12) with the same data as D’Orazio et al. (2006) that needed a much slower iterative fitting algorithm for the same data and constraints. The variable X1 took four values and the restraint is that one margin of X1 for one value is restricted to be greater than one margin of another value. For general edit/imputation, Winkler (2008b) was able to put marginal constraints on one variable to assure that the resultant micordata files and associated margins corresponded much more closely to observed margins from an auxiliary data source. For, instance one variable could be an income range and the produced microdata did not produce population proportions that corresponded closely to published IRS data until after appropriate convex constraints were additionally applied. Winkler (2010a) used convex constraints to place upper and lower bounds on cell probabilities to assure that any synthetic data generated from the models would have reduced/eliminated re-identification risk while still preserving the main analytic properties of the original confidential data. A nontrivially modified version of the indexing algorithms allows near instantaneous location of cells in the contingency table that match a record having missing data. An additional algorithm nearly instantaneously constructs an array that allows binary search to locate the cell for the imputation (for the two algorithms: total < 1.0 millisecond cpu time). For instance, if a record has 12 variables and 5 have missing, we might need to delineate all 100,000+ cells in a contingency table with 0.5 million or 0.5 billion cells and then draw a cell (donor) with probability-proportional-to-size (pps) to impute missing values in the record with missing values. This type of imputation assures that the resultant ‘corrected’ microdata have joint distributions that are consistent with the model. A naively written SAS search and pps-sample procedure might require as much as a minute cpu time for each record being imputed. For imputation-variance estimation, other closely related algorithms allow direct variance estimation from the model. This is in contrast to after-the-fact variance approximations using linearization, jackknife or bootstrap. These latter three methods were developed for after-the-fact variance estimation (typically with possibly poorly implemented hot-deck imputation) that are unable to account effectively for the bias of hot-deck or that lack of model with hot-deck. Most of the methods for the after-the-fact imputation- variance estimation have only been developed for one-variable situations that do not account for the

Page 11 of 199 multivariate characteristics of the data and assume that hot-deck matching (when naively applied) is straightforward when most hot-deck matching is never straightforward.

3. Record Linkage

Fellegi and Sunter (1969) provided a formal mathematical model for ideas that had been introduced by Newcombe et al. (1959, 1962). They introduced many ways of estimating key parameters without training data. To begin, notation is needed. Two files A and B are matched. The idea is to classify pairs in a product space A × B from two files A and B into M, the set of true matches, and U, the set of true nonmatches. Fellegi and Sunter, making rigorous concepts introduced by Newcombe (1959), considered ratios of probabilities of the form:

R = P( γ ∈ Γ | M) / P( γ ∈ Γ | U) (1) where γ is an arbitrary agreement pattern in a comparison space Γ. For instance, Γ might consist of eight patterns representing simple agreement or not on the largest name component, street name, and street number. Alternatively, each γ ∈ Γ might additionally account for the relative frequency with which specific values of name components such as "Smith" and "Zabrinsky” occur. Then P(agree “Smith” | M) < P(agree last name | M) < P(agree “Zabrinsky” | M) which typically gives a less frequently occurring name like “Zabrinsky” more distinguishing power than a more frequently occurring name like “Smith” (Fellegi and Sunter 1969, Winkler 1995). Somewhat different, much smaller, adjustments for relative frequency are given for the probability of agreement on a specific name given U. The probabilities in (1) can also be adjusted for partial agreement on two strings because of typographical error (which can approach 50% with scanned data (Winkler 2004)) and for certain dependencies between agreements among sets of fields (Larsen and Rubin 2001, Winkler 2002). The ratio R or any monotonely increasing function of it such as the natural log is referred to as a matching weight (or score).

The decision rule is given by:

If R > Tμ, then designate pair as a match.

If Tλ ≤ R ≤ Tμ, then designate pair as a possible match and hold for clerical review. (2)

If R < Tλ, then designate pair as a nonmatch.

The cutoff thresholds Tμ and Tλ are determined by a priori error bounds on false matches and false nonmatches. Rule (2) agrees with intuition. If γ∈ Γ consists primarily of agreements, then it is intuitive that γ∈ Γ would be more likely to occur among matches than nonmatches and ratio (1) would be large. On the other hand, if γ∈ Γ consists primarily of disagreements, then ratio (1) would be small. Rule (2) partitions the set γ ∈ Γ into three disjoint subregions. The region Tλ ≤ R ≤ Tμ is referred to as the no- decision region or clerical review region. In some situations, resources are available to review pairs clerically. Fellegi and Sunter (1969, Theorem 1) proved the optimality of the classification rule given by (2). Their proof is very general in the sense in it holds for any representations γ ∈ Γ over the set of pairs in the product space A × B from two files. As they observed, the quality of the results from classification rule (2) were dependent on the accuracy of the estimates of P( γ ∈ Γ | M) and P( γ ∈ Γ | U).

Page 12 of 199 Figure 1 provides an illustration of the curves of log frequency versus log weight for matches and nonmatches, respectively. The two vertical lines represent the lower and upper cutoffs thresholds Tλ and Tμ, respectively. The x-axis is the log of the likelihood ratio R given by (1). The y-axis is the log of the frequency counts of the pairs associated with the given likelihood ratio. The plot uses pairs of records from a contiguous geographic region that was matched in the 1990 Decennial Census. The clerical review region between the two cutoffs primarily consists of pairs within the same household that are missing both first name and age (the only two fields that distinguish individuals within a household).

In many situations with administrative lists, we need to process an enormous number of pairs. For instance, in the Decennial Census, we process 10^17 (300 million x 300 million). The way that we reduce computation is with blocking. Blocking consists of only considering pairs that agree on characteristics such as a Census block code plus first character of the surname. If we using multiple blocking passes, then we may additionally may consider pairs that only agree on telephone number, street address, or the first few characters of first name plus first few characters of surname. In traditional record linkage, two files are sorted according to a blocking criteria, matched, processed and then (possibly) successive residual files are processed according to subsequent blocking criteria. With a large billion- record file, each sort could require 12+ hours. BigMatch technology (see e.g. Yancey 2007; Winkler, Yancey and Porter 2010) solves this issue by embedding the smaller file in memory, creating indices for each blocking criteria (in memory) and running through the larger file once. As each record from the larger file in read in, it is processed against each of the blocking criteria and separate scores associated with each pair along with other information are output. BigMatch is 50 times as fast as recent parallel software from Stanford (Kawai et al. 2006) and 40 times as fast as parallel software from Penn State (Kim and Lee 2007). In production matching during the 2010 Decennial Census, BigMatch did detailed computation on 10^12 pairs among 10^17 pairs in 30 hours using 40 cpus of an SGI Linux machine. In equivalently large situations with slower software, a project might require 80 machines and a whole crew of programmers to split up files and slowly put together all the matches coherently in 20 weeks. There would be substantial opportunity for error as the programmers broke up files into much smaller subsets, moved subsets to different machines, and then attempted to move (possibly hundreds) of outputs back to other machines.

4. Analysis Adjustment in Merged Files having Linkage Error

In this section, we describe research into methods for adjusting statistical analyses for linkage error. Unlike the much more mature methods in the previous two sections, there are substantial research problems. Scheuren and Winkler (1993) extended methods of Neter, Maynes, and Ramanathan (1965) to more realistic record linkage situations in the simple analyses of a regression of the form y = β x where y

Page 13 of 199 is taken from one file A and x is taken from another file B. Because the notation of Lahiri and Larsen (2005) is more useful in describing extensions and limitations, we use their notation. Consider the regression model y = (y1, …, yn)’:

yi = xi’ β + εi, i = 1, …, n (3)

2 where xi = (xi1, …, xip) is a column vector of p known covariates β = (β1, …, βp)’, E(εi) = 0, var(εi) = σ , covariance(εi, εj) = 0 for i ≠ j, i, j = 1, …, n. Scheuren and Winkler (1993) considered the following model for z = (z1, …, zn) given y:

⎧ yi with probability qii zi = ⎨ ⎩ yj with probability qij for i ≠ j, i, j = 1, …, n (4)

n where ∑j=1 qij = 1 for i = 1, …, n. Define qi = (qi1, …, qin)’, j = 1, …, n, and Q = (qi, …, qn)’. The naïve least squares estimator of β, which ignores mismatch errors, is given by

^ -1 βN = (X’ X) X’ z, where X = (x1, …, xn)’ is an n × p matrix. Under the model described by (3) and (4)

E(zi) = w’β

n ^ where wi = qi’ X = ∑j=1 qij xj’ , i = 1, …, n, is a p x 1 column matrix. The bias of the naïve estimator βN is given by

^ ^ -1 -1 bias(βN ) = E(βN - β) = [(X’X) X’W – I] β = [(X’X) X’Q X – I] β. (5)

If an estimator of B is available where B = (B1, …, Bn)’ and Bi = (qii – 1) yi + ∑j≠1 qij yj . The Scheuren- Winkler estimator is given by

^ ^ -1 βSW = βN - X’X) X’ B^ (6)

If qij1 and qij2 denote the first and second highest elements of the vector qi and zj1 and zj2 denote the elements of the vector z, then a truncated estimator of B is given by

TR Bi’ = (qij1 -1) zj1 + qij2 zj2 . (7)

Scheuren and Winkler (1993) used estimates of qij1 and qij2 based on software/methods from Belin and Rubin (1995). Lahiri and Larsen improve the estimator (7) (sometimes significantly) by using the unbiased estimator

^ -1 βU = (W’ W) W’ z. (8)

The issues are whether it is possible to obtain reasonable estimates of qi or whether the crude approximation given by (7) is suitable in a number of situations. Under a significantly simplified record linkage model where each qij for i ≠ j, 1, …, n, Chambers (2009) provides an estimator approximately of the following form

Page 14 of 199 ^ -1 -1 -1 βU = (W’ Covz W) W’ Covz z (9) that has lower bias than the estimator of Lahiri and Larsen. The matrix Covz is the variance-covariance matrix associated with z. The estimator in (9) is the best linear unbiased estimator using standard methods that improve over the unbiased estimator (8). Chambers further provides an iterative method for obtaining an empirical BLUE using the observed data. The issue with the Chambers’ estimator is whether the drastically simplified record linkage model is a suitable approximation of the realistic model used by Lahiri and Larsen. The issue with both the models of Chambers (2009) and Lahiri and Larsen (2005) is that they need both a method of estimating qij for all i, j with all pairs of records and a method of designating which of the qij is associated with the true match. Scheuren and Winkler (1993) provided a much more ad hoc adjustment with the somewhat crude estimates of the qij obtained from the model of Belin and Rubin (1995). Lahiri and Larsen demonstrated that the Scheuren-Winkler procedure was inferior for adjustment purposes when the true qij were known. Winkler and Scheuren (1991), however, were able to determine that their adjustment worked well in a very large number of empirical scenarios (several hundred). Further, Winkler (2006) provided a ‘generalization’ of the Belin-Rubin estimation procedure that provides somewhat more accurate estimates of the qij and holds in a moderately larger number of situations.

5. Concluding Remarks

This paper describes methods of modeling/edit/imputation and record linkage that are reasonably mature methods in terms of improving the quality of administrative and that have been greatly enhanced by breakthroughs in computational speed. Newer methods for adjusting statistical analyses for linkage error (Lahiri and Larsen, 2005; Chambers 2009) are very much in their preliminary stages and need substantial additional research. A very new method due to Tancredi and Liseo (2011) shows great potential both theoretically and methodologically but must be extended to more practical computational situations.

1/ This report is released to inform interested parties of (ongoing) research and to encourage discussion (of work in progress). Any views expressed on (statistical, methodological, technical, or operational) issues are those of the author(s) and not necessarily those of the U.S. Census Bureau.

References

Barcaroli, G., and Venturi, M. (1997), "DAISY (Design, Analysis and Imputation System): Structure, Methodology, and First Applications," in (J. Kovar and L. Granquist, eds.) Statistical Data Editing, Volume II, U.N. Economic Commission for Europe, 40-51. Belin, T. R., and Rubin, D. B. (1995), "A Method for Calibrating False- Match Rates in Record Linkage," Journal of the American Statistical Association, 90, 694-707. Boskovitz, A. (2008), “Data Editing and Logic: The covering set methods from the perspective of logic,” CS Ph.D. dissertation, Australia National University, http://thesis.anu.edu.au/public/adt-ANU20080314.163155/index.html, (see also http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.4753 and http://www.zevep.com/php/artikeldetail_x.php?typ=oai&id=1096278). Chambers, R. (2009), “Regression Analysis of Probability-Linked Data,” Statisphere, Volume 4, http://www.statisphere,govt.nz/official-statistics-research/series/vol-4.htm. D’Orazio, M., Di Zio, M., and Scanu, M. (2006), “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints,” Journal of Official Statistics, 22 (1), 137-157. Fellegi, I. P., and Holt, D. (1976), "A Systematic Approach to Automatic Edit and Imputation," Journal of the American Statistical Association, 71, 17-35. Fellegi, I. P., and Sunter, A. B. (1969), "A Theory for Record Linkage," Journal of the American Statistical Association, 64, 1183-1210. Garfinkel, R. S., Kunnathur, A. S., and Liepins, G. E., (1986), "Optimal Imputation of Erroneous Data: Categorical Data, General Edits," Operations Research, 34, 744-751.

Page 15 of 199 Herzog, T. N., Scheuren, F., and Winkler, W.E., (2007), Data Quality and Record Linkage Techniques, New York, N. Y.: Springer. Herzog, T. N., Scheuren, F., and Winkler, W.E., (2010), “Record Linkage,” in (D. W. Scott, Y. Said, and E. Wegman, eds.) Wiley Interdisciplinary Reviews: Computational Statistics, New York, N. Y.: Wiley, 2 (5), September/October, 535-543, (presented in the session "Best from Wiley Interdisciplinary Reviews" at the 2011 Interface Conference at the SAS Institute in North Carolina) . Kawai, H., Garcia-Molina, H., Benjelloun, O., Menestrina, D., Whang, E., and Gong, H. (2006), “P-Swoosh: Parallel Algorithm for Generic Entity Resolution,” Stanford University CS technical report (available at http://ilpubs.stanford.edu:8090/784/1/2006-19.pdf). Kim, H.-S., and Lee, D. (2007), “Parallel Linkage,” Conference on Information and Knowledge Management ’07. Meng, X.-L., and Rubin, D. B. (1993), “Maximum Likelihood via the ECM Algorithm: A General Framework,” Biometrika, 80, 267-78. Neter, J., Maynes, E. S., and Ramanathan, R. (1965), "The Effect of Mismatching on the Measurement of Response Errors," Journal of the American Statistical Association, 60, 1005-1027. Newcombe, H. B., Kennedy, J. M. Axford, S. J., and James, A. P. (1959), "Automatic Linkage of Vital Records," Science, 130, 954-959. Newcombe, H.B., and Kennedy, J. M. (1962) "Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information" Communications of the Association for Computing Machinery, 5, 563-567. Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. (2000), “Text Classification from Labeled and Unlabeled Documents using EM,” Machine Learning, 39, 103-134. Lahiri, P. A., and Larsen, M. D. (2005) “Regression Analysis with Linked Data,” Journal of the American Statistical Association, 100, 222-230. Larsen, M. D., and Rubin, D. B. (2001), Alterative Automated Record Linkage Using Mixture Models, Journal of the American Statistical Association, 79, 32-41. Rao, J. N. K. (1997), “Developments in Sample Survey Theory: An Appraisal,” The Canadian Journal of Statistics, La Revue Canadienne de Statistique, 25 (1), 1-21. Scheuren, F.,and Winkler, W. E. (1993), "Regression analysis of data files that are computer matched," Survey Methodology, 19, 39-58, also at http://www.fcsm.gov/working-papers/scheuren_part1.pdf . Scheuren, F.,and Winkler, W. E. (1997), "Regression analysis of data files that are computer matched, II," Survey Methodology, 23, 157-165, http://www.fcsm.gov/working-papers/scheuren_part2.pdf. Tancredi, A., and Liseo, B. (2011), “A Hierarchical Bayesian Approach to Matching and Size Population Problems, Ann. Appl. Stat., 5 (2B), 1553-1585. Winkler, W. E. (1990), “On Dykstra’s Iterative Fitting Procedure,” Annals of Probability, 18, 1410-1415. Winkler, W. E. (1991), “Error Model for Computer Linked Files,” Proceedings of the Section on Survey Research Methods, American Statistical Association, 472-477 (http://www.amstat.org/sections/srms/Proceedings/papers/1991_079.pdf ). Winkler, W. E. (1993), "Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, 274-279, also http://www.census.gov/srd/papers/pdf/rr93-12.pdf . Winkler, W. E. (1995), "Matching and Record Linkage," in B. G. Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, Colledge, M. A., and P. S. Kott (eds.) Business Survey Methods, New York: J. Wiley, 355-384 (also at http://www.fcsm.gov/working-papers/wwinkler.pdf). Winkler, W.E. (1997), “Set-Covering and Editing Discrete Data,” American Statistical Association, Proceedings of the Section on Survey Research Methods, 564-569 (also available http://www.census.gov/srd/papers/pdf/rr9801.pdf). Winkler, W. E. (1999), “Issues with Linking Files and Performing Analyses on the Merged Files,” American Statistical Association, Proceedings of the Sections on Government Statistics and Social Statistics, 262-265. Winkler, W. E. (2004), “Approximate String Comparator Search Strategies for Very Large Administrative Lists,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also http://www.census.gov/srd/papers/pdf/rrs2005-02.pdf ). Winkler, W. E. (2006a), “Overview of Record Linkage and Current Research Directions,” U.S. Bureau of the Census, Statistical Research Division Report http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf . Winkler, W. E. (2006b), “Automatic Estimation of Record Linkage False Match Rates,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM, also at http://www.census.gov/srd/papers/pdf/rrs2007-05.pdf . Winkler, W. E. (2006c), “Statistical Matching Software for Discrete Data,” computer software and documentation.

Page 16 of 199 Winkler, W. E. (2008a), “Data Quality in Data Warehouses,” in (J. Wang, ed.) Encyclopedia of Data Warehousing and Data Mining (2nd Edition). Winkler, W. E. (2008b), “General Methods and Algorithms for Imputing Discrete Data under a Variety of Constraints,” http://www.census.gov/srd/papers/pdf/rrs2008-08.pdf . Winkler, W. E. (2010a), “General Discrete-data Modeling Methods for Creating Synthetic Data with Reduce Re- identification Risk that Preserve Analytic Properties,” http://www.census.gov/srd/papers/pdf/rrs2010-02.pdf . Winkler, W. E. (2010b), “Generalized Modeling/Edit/Imputation Software for Discrete Data,” computer software and documentation. Winkler, W. E. (2010c), “Record Linkage,” Course notes from short course at the Institute of Education at the University of London in September 2010. Winkler, W. E. (2010d), “Cleaning Administrative Data: Improving Quality using Edit and Imputation,” Course notes from short course at the Institute of Education at the University of London in September 2010. Winkler, W. E. and Chen, B.-C. (2002), “Extending the Fellegi-Holt Model of Statistical Data Editing,” (available at http://www.census.gov/srd/papers/pdf/rrs2002-02.pdf ). Winkler, W. E., and Scheuren, F. (1991), “How Computer Matching Error Affects Regression Analysis: Exploratory and Confirmatory Report ,” unpublished technical report. Winkler, W. E., Yancey, W. E., and Porter, E. H. (2010), “Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also http://www.amstat.org/sections/srms/Proceedings/y2010/Files/307067_57754.pdf ). Wright, J. (2010), “Linking Census Records to Death Registrations,” Australia Bureau of Statistics Report 131.0.55.030 (http://www.ausstats.abs.gov.au/Ausstats/subscriber.nsf/0/45CA062EC234F1C0CA2576E20021FEFB/$File/1351 055030_mar%202010.pdf ). Yancey, W.E. (2007), “BigMatch: A Program for Extracting Probable Matches from Large Files,” Statistical Division Research Report, http://www.census.gov/srd/papers/pdf/RRC2007-01.pdf .

Page 17 of 199 Some advances on Bayesian record linkage and inference for linked data

Brunero Liseo and Andrea Tancredi MEMOTEF, Sapienza Università di Roma Viale del Castro Laurenziano 9, Roma 00161 {brunero.liseo, andrea.tancredi}@uniroma1.it

Abstract: In this paper we review some recent advances on Bayesian methodology for performing Record Linkage and for making inference using the resulting matched units. In particular we frame the record linkage issue into a formal inferential problem and we adapt standard model selection techniques to this context. Although the methodology is quite general, we will focus on the simple multiple regression set-up for expository convenience.

Keywords: Bayesian computational methods, capture-recapture, model selection.

1. Bayesian use of linked data

In general, from a statistical methodology perspective, the merge of two (or more) data files can be important for two different and complementary reasons: • per sé, i.e. to obtain a larger and integrated reference data set; • to perform a subsequent statistical analysis based on the additional information which cannot be extracted from either one of the two single data files. The first situation need not any further comment: a new data set is created and appropriate statistical analyses will be performed based on it. However, the statistical theory behind the two situations must be different and we will comment on this problem later. On the other hand, the second situation is more interesting both from a practical and a theoretical perspectives. Let us consider a toy example to fix the ideas. Suppose we have two computer files, say A and B, whose records respectively relate to units (e.g. individuals, firms, etc.) of partially overlapping populations PA and PB. The two files consist of several fields, or variables, either quantitative or qualitative. For example, in a file of individuals, fields can be “surname", “age", “sex", etc. The goal of a record linkage procedure is to detect all the pairs of units (a,b), with a in PA and b in PB, such that a and b refer actually to the same unit. Suppose that the observed variables in A are denoted by (Z, W1,W2,…,Wh) and the observed variables in B are (X, W1,W2,…,Wh). Then one might be interested in performing a linear regression analysis (or any other more complex association model) between Z and X, restricted to those pairs of records which are declared matches after a record linkage analysis based on variables Wi's. The intrinsic difficulties which are present in such a simple problem are well documented and discussed in Scheuren and Winkler (1997) and Lahiri and Larsen (2005). In statistical practice it is quite common that the linker (the researcher who matches the two files) and the analyst (the statistician doing the subsequent analysis) are two different persons working separately. However, we agree with Scheuren and Winkler (1997), which say “ … it is important to conceptualize the linkage and analysis steps as part of a single statistical system and to devise appropriate strategies accordingly.”

Page 18 of 199 In a more general framework, suppose one has (Z1,Z2,...,Zk, W1,W2,…,Wh) observed on nA units in file A and (X1,X2,...,Xp, W1,W2,...,Wh) observed on nB units in file B. Our general goal can be stated as follows: • use the key variables W1,W2,…,Wh to infer about the true matches between A and B. • perform a statistical analysis based on variables Z’s and X’s restricted to those records which have been declared matches.

To perform this double task, we argue that a fully Bayesian analysis allows for an integrate use of the information which improves the linkage step and allows to account for the matching uncertainty in the estimation of the regression coefficients. The main point to stress is that in our approach all the uncertainty about the matching process is automatically accounted for in the subsequent inferential steps. This approach is based on the Bayesian model for record linkage described in Fortini et al. (2001) and improved in Tancredi and Liseo (2011). We present the general theory and illustrate its performance via simple examples. In Section 2 we briefly recall the Bayesian approach to record linkage proposed by Fortini et al. (2001) to which to refer for details. Section 3 generalizes the method to include the inferential part. Section 4 concentrates on the special case of regression analysis, the only situation which has been already considered in literature: see Scheuren and Winkler (1993) and Lahiri and Larsen (2005) for an historical account and for a more detailed illustration. Section 5 discuss the record linkage problem as one of model selection.

2. Bayesian Record Linkage

In Fortini et al. (2001) a general technique to perform a record linkage analysis is proposed. Starting from a set of key variables W1,W2,..., Wh, observed in two different sets of units, the method defines, as the main parameter of interest, the matching matrix C, of size nA times nB, whose generic element cab is either 0 or 1 according whether records a and b refer to the same unit. The parameter of interest C must satisfy some obvious constraints: we assume there are no duplicates either in PA and PB; this implies that the row and column sums of C will be either 0 or 1. In classical statistical inference matrix C would be defined to be a latent unobserved structure. The statistical model is based on a multinomial likelihood function where all the of comparisons between key variables among units are measured on a 0/1 scale. As in the mixture model proposed by Jaro (1995) a central role is played by the parameter vectors m and u, both of length 2h, with

mi = P(Yab=yi; cab=1); ui = P(Yab=yi; cab=0).

k h for i=1, …,2 , and Yab represents the 2 -dimensional vector of comparisons between units a∈ A and b∈B. In the vast majority of the applications comparisons are based on a 0/1 scale, with Yab being a vector of 0's and 1's according whether the corresponding key variable matches or not between the two records. This approach implies an obvious loss of information; recently, Tancredi and Liseo (2011) have proposed a different approach which is based on the actual observed values of the key variables. Then, independently of the way in which comparisons are performed, a Bayesian way to record linkage goes through the generation of a Markov Chain Monte Carlo sample

Page 19 of 199 from the posterior distribution of the matrix valued parameter C. See Fortini et al. (2001) and Tancredi and Liseo (2011) for a discussion about the appropriate choices for the prior distribution on C and on the other parameters of the model, mainly m and u.

3. Inference with linked data

In this section we illustrate how to construct and calibrate a statistical model based on a data set which is the output of a record linkage procedure. As we already stressed, the final output provided by the procedure described in the previous section will be a simulated sample from the (approximated) joint posterior distribution of the parameters, say (C, m, u; ξ), where ξ includes all the other parameters in the model. This can be used according to two different strategies. In fact we can either 1) compute a “point” estimate of the matrix C and then use this estimate to establish which pairs are passed to the second stage of the statistical analysis. In this case, the second step is performed with a fixed number of units (the declared matches). It must be noticed that, given the particular structure of the parameter matrix C, no obvious point estimates are available. The posterior mean of C is in fact useless since we need to estimate each single entry cab either with 0 or 1 values. The posterior median is difficult to define as well, and the most natural candidate, the maximum a posteriori (MAP) estimate typically suffers from sensitivity (to the prior and to Monte Carlo variability) problems: this last issue is particularly crucial in official statistics, where inferential results must be used for making decision. For a discussion on these issues see Tancredi et al. (2005) and, for related problems in a different scenario, Green and Mardia (2006). 2) Alternatively, one can transfer the “global" uncertainty relative to C (and to the other parameters), expressed by their joint posterior distribution, to the second step statistical analysis.

We believe that this latter approach is more sensible in the way it deals with uncertainty. Among other things, it avoids to over-estimate the precision measures attached to the output of the second step analysis. The most obvious way to implement approach B simply consists in performing the second step analysis at the same time as the record linkage analysis, that is, including the second step analysis into the MCMC procedure. This will cause a feed-back propagation of the information between the record linkage parameters and the more specific quantities object of interest. Here we illustrate these ideas in a very general setting; in the next section we will consider the regression example in details.

Let D=(Y, Z, X) the entire set of available data where, as in the Introduction, Yab represents the vector of comparisons among variables which are present in both files (or the 2h dimensional vector when the actual values of the key variables are observed), Za is the value of covariate Z observed on individual a∈ A and Xb is the value of covariate X observed on individual b∈B . The statistical model can then be written as

p(y,z,x| C,m,u,θ, ξ)

Page 20 of 199 where (C; m, u, ξ) are the record linkage parameters and θ is the parameter vector related to the joint distribution of (X;Z). The above formula can always be re-expressed as

p(y| C,m,u,θ, ξ) p(z,x| y, C,m,u,θ, ξ)

It is then reasonable to assume that, given C, the comparison vector Y does not depend on θ; also, given C, the distribution of (X;Z) should not depend both on the comparison vector data Y and the parameters related to those comparisons. It follows that model can be simplified into the following general expression:

p(y | C, m,u) p(z, x | C, θ) (1) The first term in the last expression is related to the record linkage step; the last term refers to the second step analysis and must be specified according to the particular statistical analysis. The presence of C in both terms allows for the feed-back phenomenon we mentioned before. Approaches A and B can be re-phrased using the last formula. In the case A) the first factor of the model is used to get an estimate C of C. Then C is plugged into the second factor and a standard statistical analysis is performed to get an estimate of θ. In approach B) the two factors are considered together within the MCMC algorithm thus providing a sample from the joint posterior distribution of all the parameters. In this case the Markov Chain which produces the posterior sample allows for an information feedback between C and θ.

There is actually a third possible approach to consider and we call it approach C). In fact, one can use a MCMC algorithm with the first factor only and, at each step t=1, …, T, of the algorithm one can perform the statistical analysis expressed by the second factor of the model fixing the record linkage parameters at their values, say C(t), the value of the Markov chain for the parameter C at time t. This way, one can obtain an estimate  of  at each step of the MCMC algorithm and then somehow summarize the set of estimates. In the next section we will illustrate the three approaches in the familiar setting of the simple linear regression. We anticipate that approach A) seems to miss to account for the uncertainty in the first step of the process and, consequently, it tends to produce a false impression of accuracy in the second step inferences. In general, we consider approach B) as the most appropriate in terms of the use of statistical information provided by the data. However, approach C) can be particularly useful especially if the set of linked data must be used more than one time, for different purposes. In fact, while in approach B) information flows back and forth from C to  , in case C) the information goes one-way from C to  and the record linkage step is not influenced by the information provided by (X,Z).

4. Multiple linear regression

Consider again the toy example in the Introduction and assume that our object of interest is the linear relation between X and Z, say

Z = X  

Page 21 of 199 with  being a vector of i.i.d. standard normal random variables, and = ,. One should notice that, the length of vectors Z and  are not fixed in advance, since they depend on the number of matched units. Here we describe how to implement the three different approaches discussed in Section 3. In the following we assume that our statistical model can be simplified according to (1). First we give a brief account of the method proposed by Larsen and Lahiri (2005), which generalize the pioneer approach developed in Scheuren and Winkler (1997) . They assume that the two datasets consist of the same number of units, say n=nA=nB. This assumption is quite restricted in practice. With respect to model (1), consider the matrix P where the generic element pab denotes the probability that the a-th unit of database A coincides with the b-th unit of database B. Assume that the main goal is the estimation of the regression parameters 1 ,2 ,... ,h Since the information about the true links is missing, it is useful to introduce the new variables (V1, V2.…,.Vp), where each Vi is any of the values of the response variable observed on the n units, each assumed with probability pij. Using our notation, their approach corresponds to introduce, for each unit in A, a latent vector Sa= (Sa1, Sa2, … , San) which consists of just one 1 and n-1 zeros; the one is of course the identifier of the unit a in file B. Also, S1, S2,…,Sn are assumed to be mutually independent with a multinomial distribution with parameters (1, pj), and pj = (pj1, pj2, … , pjn) Then it is easy to see that

E(Zj | S1,S2,…, Sn )= Σb Sjb Xj' β and, by the law of the iterate mean, in matrix form

E(Z )= PXβ

This produces an unbiased estimator of θ, that is

−1  X'P'PX  X ' PZ

In other words, in order to account for the uncertainty about matching, Larsen and Lahiri (2005) propose the use of a weighted combination of covariates, where the weights are estimated from the linkage model step. They also provide an estimate for the variance of the estimator via a parametric bootstrap approximation.

In general, linkage errors may weaken a linear regression analysis in several different ways; a) If one fails to detect a match, standard error of the ML estimates increase. b) If a false match is introduced in the analysis, on average, one introduces a bias which shrinks the ML estimates of the regression coefficients toward zero. The same problem will be likely to happen for the posterior distribution of the regression coefficients in a Bayesian analysis.

Page 22 of 199 Here we will try to go beyond the limitation of equal sample sizes for the two files and notice that, for a given matching matrix C, the correctly linked regression model can be written as C'Z =C ' CW C'  (2) with the convention that one must eliminate the lines with zero components in the vector C' Z in the above equation. From this perspective it is clear that the introduction of the matrix C allows for a direct generalization of the Larsen-Lahiri methods to the more general case of different sample sizes. We now discuss the three different strategies illustrated in the previous section, with a particular emphasis to the multiple regression framework.

Method A.

I. Use any Record Linkage procedure to establish which pairs of records are true matches. II. Use the subset of matched pairs to perform a linear regression analysis and provide an estimate of θ via ordinary least squares, maximum likelihood or Bayesian method.

This methodology corresponds to select a point estimate of C and to use it in the above regression expression. All the uncertainty about the matching procedure is clearly lost and not transferred to the regression analysis.

Method B.

I. Set a MCMC algorithm relative to model (1), that is, at each iteration t=1, …, T, II. draw C(t) from its full conditional distribution III. draw (m(t), u(t), ξ(t)) from the full conditional distribution IV. draw θ(t) from its full conditional distribution

From steps B-II and B-III one can notice that the marginal posterior distribution of C will be potentially influenced by the information on θ. In this case the posterior distribution of θ will account for the uncertainty related to linking procedure in a coherent way. From a theoretical perspective, this is the coherent way to proceed. All the relations among variables and parameters are potentially considered and uncertainty is accounted for in the correct way.

Method C.

I. Set up a MCMC algorithm restricted to the first factor of (1) in order to produce a posterior sample from the joint posterior distribution of (C,m,u). This can be done using, for example, the algorithms illustrated in Fortini et al. (2001) and Tancredi and Liseo (2011). II. At each iteration t=1, … ,T of the MCMC algorithm, use C(t) to perform a linear (t) regression analysis restricted to those pairs of records (a,b) such that c ab=1, and produce a point estimate  of θ, for example the OLS estimate.

Page 23 of 199 III. Use the list of estimates as an approximation of the “predictive distribution” of the used estimator.

In this third approach, setting S=C'C and using the fact that S is idempotent, from (2) one obtains, at each iteration, that  is equal to

 X'SX −1 X'SC'Z

It follows that, in this approach, the estimation of C is not influenced by the regression part of the model. This method could be safer to use (and to be preferred) if the main goal of the record linkage step was to create an enriched and reference dataset to be repeatedly used in the future for different purposes. Under the additional assumption that, given the matching matrix C, variables used in regression are unrelated to the key variable used in the record linkage analysis, methods B and C provide similar results.

From a computational perspective, method B is complicated by the fact that the full conditional of the extra-parameters given the record-linkage parameters, must be derived for any different statistical models; also the introduction of new parameters is likely to change the full conditionals of the record-linkage parameters and it might be not so simple to adjust the MCMC algorithm. This is another compelling, although practical, reason for preferring method C.

4. Selection of matches as a model selection problem

In this section we will rephrase the record linkage problem as one of variable selection in regression analysis. Suppose there are p potential explanatory variables available for the analysis and the researcher must select the best subset of variables among the 2p possible choices. p Let Kj, j=1, …, 2 , the generic subset of covariates. In a Bayesian framework, one can usually compute, for each possible Kj, its posterior probability P  K j ; data . Then one can choose either • The maximum a posteriori (MAP) model, that is the subset Kj with the highest posterior probability. This choice is optimal under a zero-one loss, although it typically suffers from a robustness problem. • The median posterior model, (MeM) that is the subset of covariates which includes all the regressors which have a marginal posterior probability higher than 0.5. See Barbieri and Berger (2004) for details. One can show that this choice is optimal for predictive purposes under a large range of reasonable loss functions and it is also more robust than the MAP • If prediction is the ultimate goal one need not necessarily choose a single model and an average prediction can be made using predictions from each single model weighted with their posterior probabilities. This approach is superior in terms of accounting for uncertainty since each single “inference” is weighed by its posterior probability. This methodology is generally known as Bayesian Model Averaging.

Page 24 of 199 In record linkage problems models correspond to specific choices of set of matches to be selected. Given nA and nB there are

min(n , n ) A B n n ∑ n ! B A k=0 ( n )( n )

possible models to choose from.

From a more theoretical perspective the correspondence between point estimates of C and models has the only drawback that in a record-linkage problem there is a correct model while this is almost never a correct perspective in applied statistics where models are, at best, more or less reliable approximation to reality, and it might be more reasonable to account for “model” uncertainty. We are currently working on this particular perspective.

References

Barbieri, M.M. and Berger, J..O. (2004). Optimal predictive model selection. The Annals of Statistics, 32, pp. 870-897.

Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M. (2001) On Bayesian Record Linkage. Research in Official Statistics, 4, Vol.1, 185-198.

Green, P.J. and Mardia, K.V. (2006). Bayesian alignment using hierarchical models, with application in protein bioinformatics, Biometrika, 93, pp. 235-254.

Herzog, T. N., Scheuren, F., and Winkler, W.E., (2007), Data Quality and Record Linkage Techniques, New York, N. Y.: Springer.

Herzog, T. N., Scheuren, F., and Winkler, W.E., (2010), Record Linkage, in (D. W. Scott, Y. Said, and E. Wegman, eds.) Wiley Interdisciplinary Reviews: Computational Statistics, New York, N. Y.: Wiley, 2 (5), September/October, 535- 543 .

Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida Journal of the American Statistical Association, 84, pp. 414-420.

Lahiri, P., Larsen, M. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, pp-222-230

Larsen, M. (2005) Advances in record linkage theory: hierarchical Bayesian record linkage. ASA proceedings

Lindley, D.V. (1977) A problem in forensic science. Biometrika, 64, pp. 207-213.

Page 25 of 199 Liseo, B., Tancredi, A. (2011) Bayesian estimation of population size via linkage of multivariate normal data sets. Journal of Official Statistics, Vol. 27 No. 3, pp. 491— 505. Scheuren, F.,and Winkler, W. E. (1997), Regression analysis of data files that are computer matched, II, Survey Methodology, 23, 157-165.

Tancredi, A., Liseo, B., Guagnano, G. (2005) Inferenza statistica basata su dati prodotti mediante procedure di record linkage. In L'integrazione di dati di fonti diverse: tecniche e applicazioni del Record Linkage e metodi di stima basati sull'uso congiunto di fonti statistiche e amministrative (P.D. Falorsi, A. Pallara, A. Russo (eds.)) Franco Angeli, pp. 41-59.

Tancredi, A., Liseo, B. (2011) A hierarchical Bayesian approach to record linkage and population size estimation. Annals of Applied Statistics, Vol. 5, No. 2B, 1553—1585.

Page 26 of 199 Applications of record linkage to population statistics in the UK

Dick Heasman, Mark Baillie1, James Danielis, Paula McLeod, Meghan Elkin Office for National Statistics, Fareham, Hampshire, PO15 5RR, [email protected]

Abstract: Population statistics in the UK are benchmarked by a decennial Census, the latest one having been held in March 2011. The design of the process to match the Census to the Census Coverage Survey (CCS) is briefly described. It is possible that traditional census-taking will not have the same role in the future of UK population statistics, the creation of a population spine by integrating administrative data sources being one of the alternative options under investigation. Some of the challenges involved in linking such data sources are outlined. Significant questions include the sensitivity of matching procedures to failures of the conditional independence assumption and how to handle missing values. The use of simulated data to research these and other issues in record linkage is presented.

Keywords: record linkage, administrative data, population statistics, missing values, conditional independence

1. Introduction

Population statistics in the UK are benchmarked by a decennial Census, the latest one having been held in March 2011. To adjust for undercount the Office for National Statistics (ONS), and its associated statistical offices for the devolved administrations of and , hold a Census Coverage Survey (CCS) in order to make a final estimate of key population domains using the Dual System Estimator. This process is extremely sensitive to the accuracy of the match between Census and CCS. Section 2 reports briefly on how ONS is performing this matching.

It is possible that traditional census-taking will not have the same role in the future of UK population statistics. ONS has established the Beyond 2011 Programme, which aims to investigate: the feasibility of improving population statistics in the UK by making use of integrated data sources to replace or complement existing approaches; and whether alternative data sources can provide the priority statistics on the characteristics of small populations, typically provided by a Census.

The creation of a population spine by integrating administrative data sources is one of the options being investigated by the Beyond 2011 Programme. However, the data sources available are not designed with population statistics in mind, nor do any of them aim to achieve complete population coverage. The challenges involved in linking the data sources are therefore expected to be considerable, and to include factors such as

1 Formerly of the Office for National Statistics

Page 27 of 199 variations in definitions and missing values in data fields, as well as poorly recorded data and unaligned dates of data capture.

Section 3 briefly describes the Beyond 2011 administrative data sources option, early plans for carrying out the record linkage and the problems in terms of data quality that ONS anticipates having to address. Section 4 discusses some of the methodological research work being carried out to provide evidence on the best way to proceed with the linkage. The design of simulated administrative data sources and ongoing research into the sensitivity of matching procedures to failures of the conditional independence assumption, the handling of missing data and optimal matching methods are outlined.

2. Matching in the Context of 2011 Census Coverage Assessment

2.1 The 2011 Census The constituent nations of the United Kingdom (UK) are England, Wales, Scotland and Northern Ireland. ONS is responsible for the conduct of the 2011 Census in England and Wales. Devolved administrations are responsible for the Census in Scotland and Northern Ireland. All three offices work closely together to deliver a Census for the whole of the UK, which is currently estimated to have 62.3 million usual residents.

England and Wales comprise 376 Local Authorities (LAs). ONS is confident that the two key targets for coverage have been hit, namely that at least 94% of the population have responded to the Census, and that there has been a response rate of over 80% in every LA. In addition, it is estimated that less than 10% of LAs have below a 90% response rate and that Inner London Boroughs (a subset of LAs that had particular coverage problems in 2001) had 5 to 15 percentage points higher response than in 2001.

2.2 The 2011 Census Coverage Survey Despite these encouraging outcomes to the Census Field Operation, a Census coverage and adjustment strategy is necessary to meet the requirements that the national population estimate is within +/-0.2% of the truth, and that all LA population estimates should be within +/-0.3%, both with a 95% confidence interval.

To achieve this adjustment ONS conducted, as in 2001, the Census Coverage Survey (CCS). It is a survey covering all households and individuals in households in about one per cent of all postcodes (a UK postcode typically consists of between 15 and 25 dwellings). The sampling strategy ensures a sufficient presence in the survey for each LA, but also uses stratification by ‘Hard To Count’ categories, ensuring a greater sample in those areas where Census response is expected to be lowest. ONS is confident that the CCS in 2011 achieved a 90% response rate, a significant achievement for a voluntary survey in a time of falling response rates to ONS social surveys.

Although the CCS fieldwork operation is kept strictly independent of that of the Census, the CCS questionnaire is designed specifically to facilitate the matching of its results to the Census. When this matching has been completed, the true population count for the key population groups (quinary age band by sex by LA) are estimated using the Dual System Estimator. While this paper is not the place to discuss its workings, it is worth noting that the estimates are sensitive to errors in matching, due to the small size of the

Page 28 of 199 survey sample relative to the population. Sensitivity is even greater in populations with low response in either the Census or the survey.

2.3 Possible Census overcount As well as the risk of undercount through Census non-response, there is also, for a variety of reasons, the risk that duplicate Census returns will be made for the same individuals. ONS is running a matching exercise to detect duplicates in samples of the Census database, in order to estimate the size of any overcount.

2.4 Matching the 2011 CCS and Census As noted above, the impact of matching errors, whether false positives or negatives, is unusually high for this particular matching exercise. To prevent such errors from occurring as far as is practicable, a large number of clerical matchers is used. They are deployed at various stages of the procedure: resolving all multiple matches arising from the automated matching; quality checking samples of the matches made by the automated matching (every LA will have a sample checked); resolving the pairs allocated to the clerical review region by the automated matching; and searching the Census database for residual units from the CCS that are not matched.

Clerical matching is aided by the clerical matchers having access, where necessary, to the images of Census returns. Thus all sorts of contextual clues can be taken into account. In total, the clerical matching will be a full-time job for at least 23 people for 9 to 12 months. Not all of this effort will be directed towards matching the CCS and the Census: some will also be expended, for instance, on resolving the clerical review region in the detection of duplicates.

The matching has an overarching hierarchical structure: households are firstly matched, followed by individuals within households, then individuals within the set of unmatched households. Finally, residual units still not matched are submitted for clerical review.

The automated matching has been written in SAS and designed in a modular format using macros for flexibility and ease of development. It is not the intention of this paper to go through each macro or to give an exhaustive account of how it works. Instead, the key choices, from a methodological point of view, made in the design of the matching are now briefly listed and explained.

• The matching exercise is carried out on pre-imputed data, e.g. dates of birth are not donated to records where this information is missing. • Since name variables are strong personal identifiers, any records with null or blank values in these fields are excluded from the automatic matching and clerical resolution. They are made available for matching in the final clerical review. • Data are cleaned by removing any white-space characters, transforming all strings to upper case and the removal of common tokens such as titles. Standardisation is carried out by the use of look-up lists for common abbreviations and acronyms in variables such as address and ensuring a standard entry for telephone numbers. • Data on date of birth are collected as three separate variables and all are used in the matching of individuals. • In each phase, exact matching (i.e. perfect agreement on all fields) is used before the residues are matched using probabilistic matching.

Page 29 of 199 • The probabilistic method used is that of Fellegi and Sunter. Macros have also been prepared to carry out the method of Copas and Hilton (used in 2001) but it is unlikely that these will be needed. • The parameters of the probabilistic matching are not determined by using training data. The main reason for this is a practical one: data is processed and comes in to the matching exercise on an LA by LA basis. It will take several months before the data from the last LA to be processed are received. If training data were to be used they would have to come from the ‘early’ LAs, but would not necessarily be typical of data for the rest of the country. • Choice of matching variables may also vary from one LA to another. The matching system has the functionality to evaluate the discriminative power of each variable on a per LA basis. If the diagnostics indicate that the default matching variables provide low levels of matches for one LA, a new set of matching and blocking variables may be selected and the matching process rerun. • Throughout the probabilistic matching process, conditional independence is assumed. Although the matching variables used will not in fact be fully statistically independent, they are chosen to remove redundancy as far as possible. • Search space reduction is achieved through blocking. Consideration is given to the level of independence between blocking and matching variables. Multiple blocking passes are used to guard against the blocking strategy creating false negatives. • Macros have been written for many different string comparators. Matchers can experiment with using a different choice if from the diagnostics the default does not appear to be working well. The default is bigrams, based on the assumption that most textual errors will result from the difficulty of reading Census returns using optical character recognition technology. Each string comparator must be input with a threshold, as the output is a decision on whether the strings agree or disagree. • If the value of a variable is missing from either dataset, it is not entered for comparison and the variable makes no contribution to the total matching score for the record pair. • Parameter estimation is achieved using the EM algorithm. Testing has found that starting values of 0.51 for m-probabilities and 0.49 for u-probabilities work perfectly well. • The clerical review regions are initially set to be wide, but are likely to be narrowed after results of clerical reviews are fed back from the first few LAs.

3. Linking Administrative Data for the Beyond 2011 Project

3.1 Background The Beyond 2011 Programme is investigating three major options: administrative data based options, census options and survey options. Within the first of these, the record level model is a possible candidate. Section 3.2 outlines the major administrative data sets available to ONS, section 3.3 outlines the record level model and section 3.4 discusses quality issues connected to the data sets available.

Page 30 of 199 3.2 Administrative sources The two administrative data sources that have the potential to include the national population at all ages are the Patient Register Database (PRD), which holds details on individuals registered with doctors, and the ‘Customer Information System’ (CIS), which holds records of people who have been a client or customer of the tax or benefit administrations, i.e. tax payers and benefit and pensions claimants.

Some other administrative sources provide good coverage of certain sub-populations which may be hard to count on these broad coverage sources. These include: • The Migrant Worker Scan, which provides information on international migrants to the UK who have registered for and been allocated a National Insurance number. • The School Census which collects data on state school pupils in England. It has good coverage of children aged 5-15 in England and collects a broad range of demographic information. • HESA Student Data which records students registered at Higher Education institutions who are following a course that will lead to a qualification. • Birth, marriage and death records and the electoral roll.

3.3 The record level model Under this model the broad coverage administrative sources, the CIS and PRD, would be linked together at the individual level to produce an initial population spine. This might then be linked to the other datasets to attempt to improve coverage in hard to count groups. Address information in the spine might also be linked to a register of addresses. Rules would need to be developed on what combination of appearances in the data sets would qualify a record to appear in the population spine.

Linking the sources would lead to a degree of over and under estimation of the sizes of the key population groups. A coverage assessment (which is likely to take the form of a survey) would therefore be needed to assess the accuracy of the initial population count. Individuals in the coverage survey would be matched to records arising from the original data linkage process. This would enable ONS to estimate the extent of over and under count in different domains and to estimate weights which could be applied to adjust the initial population estimates to correct for this.

3.4 Quality of the data sources The primary reason for collecting the data sources described above was not to enhance demographic statistics. Consequently they contain some features likely to present ONS with a challenge when it comes to link them. An example of this appeared in an exercise where PRD was matched to School Census data for 5 to 15 year olds. Exact and unique matches by sex, date of birth and postcode alone2 outnumbered the exact matches made when first and last name were added to the matching variables. This was despite the fact that excluding the name fields led to a reduced number of unique matches as matches involving multiple birth siblings could not be distinguished as unique. Only when allowance was made for typographical errors and spelling variations in the name matching using the careful application of a string comparator did the inclusion of name data lead to a better matching result.

2 Sampling and clerical checking revealed very few of these matches to be false links.

Page 31 of 199 For legal reasons, ONS has yet to gain access to record-level CIS data. The issues that may occur with this source are therefore not fully explored, but there is no particular reason to expect that it will be of better quality than others. In one respect, it is expected to be worse, in that a record will only be updated after some contact between the tax or benefit authorities and the individual. For some, these episodes may be years apart.

In all the data sources but to varying degrees, ONS expects to find missing values in some variables. For record linkage purposes, missing data is still of some use and clearly of more use than imputed data. ONS is therefore requesting its suppliers not to impute data into these sources. Despite this, with computer collection making it increasingly hard to miss data, cases are ‘forced’ through, for instance where benefit claimants are given the postcode of the Job Centre, or members of some ethnic groups all have recorded date of birth as 1 January. To guard against this type of risk, ONS will need to look at histograms of matching variables and consider whether the values recorded at a spike are genuine. Where it appears more likely that they are not, it will be better to recode these values as missing.

4. Research on Record Linkage Methods

Within ONS there is a Methodology Directorate (MD) to provide the technical foundation for the production and analysis of official statistics, one part of which is currently providing methodological support for the planning of the Beyond 2011 record level model. This work is currently concentrating on the treatment of missing values, the advantages and risks involved with making the conditional independence assumption, evaluating possible different matching methods (e.g. deterministic, probabilistic), and evaluating software packages that might be used to carry out the linkage for the model.

4.1 Missing values A search of the literature reveals that under standard probabilistic methods, missing values can be catered for by leaving out any variable that contains missing values from the computation of the total weight for the record pair in which the missingness occurs. This approach, recommended in Hand & Yu, 20013, is implemented in matching the 2011 Census to the CCS and Tromp et al, 2008. Under the standard Fellegi-Sunter model of record linkage, which respectively assigns a positive or negative individual weight to a variable depending on whether there is agreement or disagreement, this is equivalent to assigning a zero weight to a variable where a missing value occurs in the pair.

Other methods can also cater for missing values. The software FRIL, for instance, employs user-allocated and automatically tuned weights for matching variables. Each of these weights is multiplied by a ‘score’ representing the agreement between the values in the two data sets, the score being a number in the interval [0,1]. Where missing values occur, the score can be set by the user. The default missing value score is 0.5.

3Hand & Yu is not a paper about record linkage per se, but a paper on supervised classification methods. Record linkage is a special case of such methods, which seeks to classify record pairs into two or three classes. The authors claim that this approach to missing values is consistent with making the conditional independence assumption, but not consistent with accounting for dependencies.

Page 32 of 199

4.2 The Conditional Independence Assumption (CIA) The CIA is that the joint distributions of the agreement statuses of the match variables, both conditional on a record pair being a match and on it being a non-match, are independent. Failures of the CIA among matches can arise, for instance, when one of the sources is collected using optical character recognition on a handwritten return. Since respondents with poor handwriting can give rise to errors in a number of matching variables, agreement/disagreement with another source on these variables is positively correlated.

Failures among non-matches can be more pronounced. In the UK, fashions in the naming of babies change considerably over time. Hence if two different individuals agree on year of birth, the probability that they also agree on first name is increased. Clearly, if sex is used as a matching variable, first name has dependency on it.

The classical record linkage model uses the CIA, as also do methods such as those employed by FRIL. MD has conducted a review of the literature on the use of the CIA in matching, with the particular aim of discovering evidence from empirical studies on the sensitivity of the accuracy of the matching outcome4 to failures in the CIA.

Hand & Yu discuss the independence model in the field of supervised classification methods. They find that this approach, despite being based on a model that is clearly unrealistic in most cases, has a long and successful history. They argue that the main reasons for this are: the simplicity of the model means that it requires fewer parameters to be estimated than alternative methods, resulting in a lower variance for the estimates; that the model may not give accurate probability estimates but these are not needed for classification as all that needs to be preserved is rank order; and that in real problems, variables typically undergo a selection process before being combined to yield a classification, resulting in a tendency towards using only weakly correlated variables.

The review found some papers reporting empirical findings on the performance of the independence model in record linkage. These are marked ^ in the references section. Several of these focus on designing methods to model the dependencies, and it is therefore not surprising that they have found that accounting for dependencies, where they exist, results in improved matching. Some (e.g. Schürle, 2003, using street name, postcode and district in the Berlin telephone directory; Tromp et al using child’s expected data of birth and child’s actual date of birth in matching perinatal data) use highly correlated variables in coming to these conclusions. Exceptionally Sharp (2011) investigates the use of only moderately correlated variables and finds the independence model yields a better performance than one that accounts for all dependencies.

The review found that the conclusion of Winkler (1999) is still valid; matching quality is improved using dependence methods but it has not been demonstrated that accounting for dependencies is assured to yield appropriately good quality matching in actual record linkage software on a day to day basis.

4 By ‘matching outcome’ we mean the classification of the pairs into matches and non-matches. At the moment, MD is less interested in how sensitive the underlying statistical model is to failures in the CIA.

Page 33 of 199 Tromp et al make an interesting point about dependencies between variables in the non- matches. They note that these usually arise from a latent factor, in their case the timing of the pregnancy. Matching administrative data to provide a population spine will encounter similar, if less severe, situations. For instance, where ethnic minorities exist in the population and choose names from a different name pool to the majority population, dependency between first name and last name will arise.

4.3 Synthetic data MD has created synthetic data sets for use in record linkage research, training and software evaluation. These were first used for the on-the-job training course provided by the Data Integration ESSnet in January 2011. The data sets created are a ‘truth’ data set and three others meant to simulate the 2011 Census and contemporaneous versions of the PRD and CIS. The truth data set contains approximately 25,000 records of individuals. The other three data sets are large subsets (90 or 95 per cent) of the truth set with errors, some of which are correlated, introduced into the matching variables. The errors include replacing values by blanks, as missing values are known to be a problem.

The truth data set is built up in layers. Initially a district is chosen, then postcodes within it. Next, street and streets numbers are allocated from a set of street names. Now each address is populated with a household, with the number of persons being randomly generated. In most cases all members of the household are allocated the same last name. Finally first names and dates of birth are allocated to the persons, with controls over the year of birth to make the population profile and structure of households realistic.

4.4 Deterministic methods (rule-based matching) ONS has a long tradition of employing deterministic methods of matching, most notably in updating the Longitudinal Study (an anonymised research database based on a sample of persons, having one of four specified birthdays, extracted from the Census and updated using birth and death records) from the latest Census. MD prefers the phrase used by the Relais developers: rule-based matching.

A rule-based match consists of a number of sub-rules. A sub-rule states a condition that the record pair must satisfy in order to be classified as a match, and may consist of conditions that must be checked at once; these conditions are separated by an “AND” operator. The different sub-rules are separated by an “OR” operator5. Schemes for rule- based matches can be very complicated, and to aid understanding are often illustrated by a flow diagram.

MD uses the term ‘score-based matching’ to draw a contrast with rule-based matching. This term is meant to cover any method whereby a weight is allocated to each matching variable; the weight is multiplied by a factor on the interval [0,1] which represents agreement status on that variable for the record pair to provide a score for the variable; and the variable scores are summed to provide a total score for the record pair.

At first sight, there seems to be an obvious principle: for any rule-based match there exists a score-based match that will make the same matches, and if it also makes other matches, these will be as good in quality as those made by the lowest quality sub-rule.

5 If desired, the sub-rules could be structured in such a way that they are mutually exclusive.

Page 34 of 199 But actually this only applies under the CIA. Indeed, it is possible to construct a highly artificial example with dependent variables as a counter-example.

MD has undertaken a piece of empirical work to discover whether the principle applies in realistic data sets. Using the PRD and CIS sets in the synthetic data a rule-based match was devised for matching them, and twice refined to improve its performance in the light of results. The ‘true’ match status can be readily checked in the synthetic data, and performance is measured simply by the number of true matches made minus the number of false matches made.

A score-based match was then derived from the refined rule-based match, allocating weights to the matching variables and setting a threshold in such a way that all the matches made by the rule-based match must be made by the score-based match. Table 1 shows that the extra matches made by the score-based match include more true than false links and that therefore overall performance is improved. While carrying out this work it became apparent that a score-based match with a higher threshold performed better still: although it did not make all the matches made by the rule-based match it made both more true matches and fewer false matches. This is also shown in table 1.

Table 1: Performance of rule-based and score-based matching options in matching 1 synthetic PRD and CIS data sets Rule-based match Score-based match2 Score-based match threshold = 5 threshold = 6 True matches made 21,298 21,704 21,508 False matches made 145 524 112 Difference3 21,153 21,180 21,396

1Total number of true matches = 22,860 2Formulated to make all the matches made by the rule-based match 3Simple measure of performance of the matching method

To complete this research work the data sets were made more of a challenge for score- based matching by increasing the dependence between some variables. In matches, this was done by making errors or blanks occur simultaneously in some records for day, month and year of birth. In non-matches, the population was divided into three different artificial ethnicities, distinguished by having distinct sets of names to choose from for first and last name. The first names in the majority ethnicity were further subdivided into three sets and assigned respectively to three different age cohorts of the population. This has given ONS a new and perhaps more useful set of data for record linkage research. The above exercise will now be repeated on the new PRD and CIS data sets.

4.5 Score-based and probabilistic matching The Fellegi-Sunter method for record linkage is one type of score-based matching, where a weight is the logarithm of the ratio of the m-probability to the u-probability, and the multiplying factors are simply 1 for agreement and 0 for disagreement. Nadeau et al., 2006, describe probabilistic matching in a loose sense as allowing for alternative ways other than probabilities to determine the weight for each variable, and MD’s use of the term score-based matching has the same intention.

Page 35 of 199 MD has recently started another piece of research work to compare matching under the Fellegi-Sunter model with the alternative strategies that still use the concept of weights, which are grouped under the description of the ‘allocated weights method’. Part of this will be an information-gathering exercise. Box 1 lists the main perceived advantages and disadvantages of the two methods. For example, the Fellegi-Sunter model, or the softwares currently developed for implementing it, appears to be restricted to handling only binary levels of variable agreement, thus discarding useful information, while the alternative methods appear to be based on no model at all. The research will aim to find out to what extent these perceptions are true, how important the advantages and disadvantages are in practice, and whether there are work-arounds for the disadvantages.

Box 1: Comparison of perceived advantages and disadvantages of two different types of score-based match Fellegi-Sunter method Allocated weights method Advantages 1. Gives a framework for 1. Only half the number of parameters need estimating the parameters be estimated, one for each variable instead of 2. Parameters can be tuned using the two EM algorithm 2. Parameters can be tuned (e.g. in the FRIL 3. Model can be used for error software) estimation 3. Can cater for missing values in a way that 4. Can cater for missing values when allows user flexibility CIA is made 4. Can flexibility cater for partial agreements Disadvantages 1. Double the number of parameters to 1. Initial parameter estimation is at best estimate in the model makes for more expert opinion and at worst guesswork uncertainty 2. Lacks an underlying statistical model 2. Restricted to binary agreement values

Finally, research is planned to see if the two methods converge under certain conditions. A software that uses the Fellegi-Sunter method and the FRIL software could both be set to match the synthetic PRD and CIS data sets: both would use the same set of matching variables; both would use only binary agreement values; both would use the same blocking strategy; and both would use the EM algorithm to tune their parameters. Both methods would do the same job of classification into matches or non-matches if they imposed the same rank order on the pairs. If they were different, the distribution of the true matches on the ordinal scale could be compared to determine if either method had a superior performance to the other in this experiment.

References

Fellegi, I. P. & Sunter, A. B. (1969) A Theory for Record Linkage, Journal of the American Statistical Association, 64(328), 1183-1210. Hand, D. J. & Yu, K. (2001) Idiot’s Bayes: Not So Stupid After All?, International Statistical Review, 69(3), 385-398. Jurczyk, P. (2009) FRIL: Fine-grained Integration and Record Linkage Tool V3.2: Tutorial. Available at http://fril.sourceforge.net/. Copyright: Emory University, Math&CS Department, 2009. Nadeau, C., Beaudet, M. P. & Marion, J. (2006) Deterministic and Probabilistic Record Linkage, Proceedings of Statistics Canada Symposium 2006: Methodological Issues in Measuring Population Health. Available at:

Page 36 of 199 http://www.statcan.gc.ca/pub/11-522-x/2006001/article/10404-eng.pdf. Office for National Statistics (2010) papers on the Census to CCS matching are available from the presenter on application by email. Office for National Statistics (2011) Census Roadshows – September 2011. Available on request from [email protected]. Office for National Statistics (2011) Beyond 2011: Administrative data sources and low-level aggregate models for producing population estimates (presented at the Annual Conference of the British Society for Population Studies 2011). ^Schürle, J. (2003) A method for consideration of conditional dependencies in the Fellegi and Sunter model of record linkage, Statistical Papers, 46, 433-449. ^Sharp, S. (2011) The Conditional Independence Assumption in Probabilistic Record Linkage Methods (presented at the sixteenth GSS Methodology Symposium). Edinburgh: National Records Scotland. ^Thibaudeau, Y. (1989) Fitting Log-Linear Models in Computer Matching, Proceedings on the Section on Statistical Computing, American Statistical Association, 283-288. ^Thibaudeau, Y. (1993) The Discrimination Power of Dependency Structures in Record Linkage, Survey Methodology, 19(1), 31-38. ^Tromp, M., Méray, N., Ravelli, A. C., Reitsma, J. B. & Bonsel, G. J. (2008) Ignoring Dependency between Linking Variables and Its Impact on the Outcome of Probabilistic Record Linkage Studies, Journal of the American Medical Informatics Association, 15, 654-660. ^Winkler, W. E. (1989) Methods for Adjusting for Lack of Independence in an Application of the Fellegi-Sunter Model of Record Linkage, Survey Methodology, 15(1), 101-117.

Page 37 of 199 Integrating registers: Italian business register and patenting enterprises

Daniela Ichim, Giulio Perani, Giovanni Seri ISTAT, The Italian National Statistical Institute Via C. Balbo, 16, 00184, Rome, Italy {ichim,perani,seri}@istat.it

Abstract: The paper describes the record linkage scheme followed at the Italian national statistical institute to match micro-data on patent application from the international database PATSTAT with the data available from the Italian Official Business Register (ASIA). The target data in PATSTAT are the applicants based in Italy registering patent/s in the period 1985-2010. Patents applicants can be ‘individuals’ or ‘establishments’. In this last category we aim at identifying business enterprises who were active (as recorded in ASIA) in the period 1989-2008. The wishing output of the linkage process is, for each patenting enterprise, a pair composed by the ‘applicant identification code in PATSTAT’ and the ‘enterprise identification number in ASIA’. This last allows for accessing the repositories of the official statistical data and, therefore, linking economic data to patenting enterprises. Statistical analysis such as: identifying the premises of patenting propensity; evaluate the impact of patenting on the enterprise profitability; etc. can be then performed. On the methodological side, linkage of patent data has to rely on the ‘applicants names’. Consequently, a great effort has been put in the pre-processing phase of the process to standardise the applicant/enterprise names and extract the ‘legal form’ from the name string. During the linkage process, two practical problems were faced: the reduced number of comparison variables and the huge dimension, in terms of number records, of the Italian Business Register. These issues were addressed within a rule-based deterministic record linkage approach. In this paper, together with the results obtained, we will illustrate the main features of the sequential searching and linkage methodology we adopted.

Keywords: patents, business register, deterministic record linkage

1. Introduction

The paper describes the record linkage scheme followed at the Italian national statistical institute (Istat) to match micro-data on patent application from the PATSTAT database with the data available from the Italian Official Business Register (ASIA) as a preliminary stage of a project aiming, mainly, at monitoring and profiling Italian patenting enterprises. The target data in PATSTAT are the applicants based in Italy registering patent/s in the period 1985-2010. Patents applicants can be ‘individuals’ or ‘establishments’. In this

Page 38 of 199 last category we aim at identifying business enterprises who were active (as recorded in ASIA) in the period 1989-2008. The linkage output would be, for each patenting enterprise, a pair composed by the ‘Applicant Identification Number in PATSTAT’ and the ‘Enterprise Identification Number in ASIA’. This last allows for accessing the repositories of the official statistical data and, therefore, linking Istat economic data to patenting enterprises. For example, factors influencing patenting propensity of enterprises might be studied, as well as the economic impact of patenting activity. On the methodological side, linkage of patent data has to rely on the ‘applicants names’. Consequently, a great effort has been put in the pre-processing phase to standardise the applicant/enterprise names and extract the ‘legal form’ from the name string. During the linkage process, two practical problems were faced: the reduced number of comparison variables and the huge dimension, in terms of number records, of the ASIA. These issues were addressed within a rule-based deterministic record linkage approach. In this paper, we will illustrate the main features of the adopted sequential searching and linkage methodology. The paper is organised in 4 sections. In section 1 a description of ASIA and PATSTAT databases is provided. In section 2, details on the record linkage methodology as applied to these particular datasets are reported. The emphasis is put on search space reduction methods due to the number of comparison variables and to the huge amount of data. In section 3, some preliminary results are shown. In the last section, some conclusions and ideas for further improvements are given.

2. Registers: Italian business registers and patenting persons

A patent is an exclusive right granted by an authorized patent office for an invention, which is a product or a process providing a new (technical) solution to a problem. A patent provides protection for the invention to the owner of the patent. The first step in securing a patent is the filing of a patent application that involves three main actors: the inventor, the owner and the applicant. The EPO database “Worldwide Patent Statistical Database”, called PATSTAT, is probably the most complete and updated database on patents and patents applications. PATSTAT is updated twice a year, contains 20 tables organized as a relational database with more than 70 millions of records from over 80 countries. In this work only the two tables depicted in Figure 1 are considered. The link between them is given by the unique values of the field Applicant Identification Number, AIN. The AIN also contains the patent year of registration. The time period covered by the database is 1985-2010. PATSTAT registers both the inventor and applicant name; only the latter was used in this work. There is no explicit database field concerning the legal form of the inventor, owner or applicant. The possible legal form has been therefore extracted from those names. About the applicant, PATSTAT also registers its address (street, city, postal code) and its country code. Only applicants based in Italy, i.e. COUNTRY_CODE = “IT”, were selected. In this work, the postal code was used as geographical location assuming it has the same accuracy as the address. About the patent, PATSTAT registers its IPC (International Patent Classification), its application and publication number. It is worth noting that a patent could have assigned more than one IPC codes. It should also be stressed that there is no formal/well-defined relationship between IPC codes and the principal economic activity classification (NACE). Additional details on PATSTAT may be found at www.epo.org.

Page 39 of 199 Figure 1: Used database tables from PATSTAT; COUNTRY_CODE = “IT” PATSTAT (1) Applications AIN (by year) Publication number International Patent Classification (IPC)

PATSTAT (2) Applications AIN (by year) Publication number Applicant name Applicant code Postal/Zip code Applicant Country Applicants may be individuals or establishments. The latter, according to the Frascati manual, see OECD (2002), could be: business enterprises, public institutions, non-profit institutions and private or public universities. In this work, the identification of patenting business enterprises is addressed. On enterprises, the Istat business register ASIA, is considered. ASIA is developed, updated and maintained through the statistical integration of different administrative sources (Tax Register, Social Security Register, etc.), covering the entire population of enterprises of industry and services. Among the variables included in ASIA, one may specify: a) Enterprises Identification Number, EIN, (an Istat internal and unique identification number allowing linkage to whatever economical information on the same unit collected by Istat); b) Enterprises Name c) Zip Code d) NACE code e) Geographical information (address, municipality, province, region), f) Legal form It has to be observed that ASIA and PATSTAT variables overlap only on Enterprise Name and Zip Code. Only enterprises being active in the period 1998-2008 have been analyzed (the size of ASIA varying from 3.8 to 4.5 millions of records). Considering that enterprises showing a high research and innovation propensity could have higher patenting propensity, a preliminary investigation has been conducted on the list frame of Research and Development survey (a subset of ASIA).

3. Development of a record linkage process

PATSTAT counts 299769 applications based in Italy and identified by an AIN. The number of non duplicated application numbers reduces to 72037. To each AIN in the PATSTAT database, an applicant name and the Zip Code are assigned. Additional variables may be derived from the previous information: year of application, year of first/last application by applicant; number of patent applications filed by each applicant, region of residence of the applicants, etc. The variable Applicant Name has been subject to the following standardisation operations: 1. transformation of all letters in upper case letters 2. removal of punctuations: (accents, symbols and special characters, double spaces; dots)

Page 40 of 199 3. standardisation of known abbreviations (e.g. we found about 150 ways to say “in short”) 4. standardisation of the most frequent words using equivalence lists, a deterministic record linkage procedure in Relais, see Istat (2011) a) input files: a file of words with frequencies greater than 1000; a file of words with frequencies greater than 100, but smaller than 1000; b) parameters: comparison function = “Edit distance”; threshold=0.8, greedy algorithm to perform the one-to-one assignment; c) output check: the word pairs declared “match” were subject to a clerical review; d) output: 122 matched pairs standardized; generally concerning singular/plural or Italian/English translation (for example: SERVICES/SERVIZI); 5. removal of duplicated words in the same name; 6. ordering of words in alphabetical order; 7. identification and standardisation of the legal form and storage in a variable called Legal Form. About 80 ways of expressing 6 main standardized legal forms were identified. In table 1, the distribution of the variable Legal Form is shown. Around 40% of the records have none legal form, while the majority (about 56%) is concentrated in “LTD” categories. The same pre-processing was applied to ASIA. The resulting variable is called Standardized Name and has been used as comparison variable together with Zip Code and Legal Form. The PATSTAT data file has been de-duplicated (by records having simultaneously the same values for the three comparison variables). Thus, the number of records reduced from 72037 to 23833.

Table 1: Distribution of Legal Form, PATSTAT database

Legal Form COOP SAS SNC SPA SRL Total Frequency 8979 63 501 756 6164 7370 23833 % 37.67 0.26 2.10 3.17 25.86 30.92 100

The linkage output should be the pair AIN (PATSTAT) - EIN (ASIA). The latter allows linkage of structural and economical information stemming from Istat official surveys to patenting enterprises. In PATSTAT, Applicant Name is missing in 40 records, while Zip Code is missing in about 10% of records. Besides the missing value problem, variable Zip Code in PATSTAT, also presents about 9.4% of values representing the geographical location only at aggregated level. 3.2 Search space reduction Due to the size of ASIA, the amount of candidate matching pairs is huge and the usage of search space reduction techniques has been necessary. In this section details on the search space reduction techniques applied to PATSTAT and ASIA are given. Moreover, a blocking technique by neighbourhoods of words is introduced. Some classical blocking techniques based on the patent year or 2-digit ZIP Code were not effective; these are not further detailed here. PATSTAT was reduced in order to contain only units probably representing enterprises. A list of Italian First Names, containing about 1600 records was used. From the PATSTAT database, we removed those records whose Standardized Name satisfy

Page 41 of 199 simultaneously the following conditions: a) it contains an Italian First Name b) it has an empty Legal Form and c) it does not contain special words indicating a business activity (e.g. enterprise, systems, etc.). About 63 such special words were found. PATSTAT was then divided in two parts: 7700 records considered non-enterprises and 16132 records declared enterprises. The record linkage process was applied to the latter. The eleven datasets of ASIA (1998-2008) were prepared in such a way that an active enterprise is included only once in their union. ASIA 2008 was the most complete and updated version. In the union of different waves of ASIA, except for 885 records (over more than 7 millions), the ZIP Code is always registered with 5 digits. Due to the huge computational burden, ASIA 2008 was divided in three parts: a) with more than 10 employees, b) with 1-9 employees, with non-empty Legal Form, and c) with less than 1 employee with non-empty Legal Form. None of the comparison variables was considered reliable enough to be used as blocking variable. The idea of neighbourhood of words was then introduced. For a pair of records, it was assumed that a necessary matching condition is that their Standardized Names share at least one exact word. Thus, it was assumed that at least one word is registered correctly. Then, for each PATSTAT record, the list of words forming its Standardized Name was found. Next, for each such a word, the list of enterprises in ASIA containing it was identified. The union of lists of such enterprises was named Neighbourhood of the Standardized Name under consideration. If an exact match on Standardized Name exists, it should belong to this Neighbourhood. For each record in PATSTAT, the record linkage procedure was applied using the Neighbourhood as blocking variable. Blocking by Neighbourhood allows us to divide the search space in a huge number of much smaller search spaces. Obviously, the number of search spaces equals the number of records in PATSTAT and RELAIS may deal with many search spaces in an automatic manner. Each search space has a reduced size. The maximum size of such search spaces equals 15570, a very reasonable size to deal with in record linkage problems. By construction, each Neighbourhood contains at maximum one correct link. Due to this reason and to the dependency between Neighbourhood and Standardized Name variables, this blocking procedure, as it was defined here, could hardly be used in any probabilistic record linkage (missed independence). Names having the longest word less than 2 characters were excluded from the search space creation as they could create huge Neighbourhoods (as very common words can also do). Moreover, it might happen that some Standardized Names have an empty Neighbourhood. This is generally the case for Standardized Names of a single word (if such words are differently registered in PATSTAT and ASIA for example). Of course, neighbourhoods could be defined also by an approximate matching (e.g. a similarity distance different instead of equality distance) of at least one word. 3.3 Deterministic record linkage Even if the Neighbourhood was used as blocking variable, a similarity criteria between Standardized Names was used to give an overall measure of the records similarity. A compound deterministic rule was used: at least one of the following string comparators should be greater than 0.8: Jaro; Levensthein; Jaro-Winkler; Dice; 3-Grams; equality rule1. The selection of the unique links was performed using a greedy solution

1 Details on the implementation of this comparison functions may be found in the RELAIS manual.

Page 42 of 199 implemented in RELAIS. Equal weights for all rules were always used. Finally, the pairs declared matches were subject to a clerical review.

4. Results and exploration of possible analysis

At this stage, the number of found “correct” link is 13526 out of 16132, i.e. 84%. As for “correct” link we intend a (non duplicated) pair AIN - EIN; the pairs declared links have been clerically classified as “correct”, “possible links” (possibly subject to more detailed and sophisticated clerical review) or “false” (discarded). Even if pairs are non- duplicated, some of them may represent duplication of Applicants (more than one applicant may be linked to the same enterprise. This situation might happen when a multi patenting applicant has been registered with different names in different applications and the standardisation process does not compensate for these differences. To asses the quality of the results, a short experiment has been conducted on a set of 190 codes randomly selected from the Espacenet web database (the AIN field has been used to download patent information from the EPA web-site2). We found 5 mismatches out of 190 records (2,5%). This means that, even if the available standardised information coincide in the two sources it is not possible to guarantee 100% exact link because of very similar (or common) names. Other possible sources of misclassification that should be taken into account when checking the quality of the linkage process are: enterprises belonging to the same enterprise group often register their patents with similar names; the changes occurred to enterprises through their life (changes of address, legal form, etc.). Table 2: Distribution of patenting enterprises which are active in 2009.

NACE SizeClass Tota 1 [2-9] [10- [100- l % 99] 1 28 77 255 1121 427 1880 20.6 2 25 38 130 493 142 803 8.8 3 46 155 292 287 34 768 8.4 4 22 23 76 327 124 550 6.0 … … … … … … … 25 103 73 Total 1251 1866 4202 1786 9105 % 13.74 20.49 46.15 19.62 % Pop 58.44 36.49 4.81 0.26

The structure of the patenting enterprises is the first joint analysis that could be performed once the linkage between patents and enterprises is found. In Table 2 the patenting enterprises being active in 2009 are reported by size class and Nace code. The 9105 patenting enterprises still active in 2009 are distributed over 73 NACE divisions. Only 25 of these divisions show a frequency greater that 100. Only the most frequent

2 Ten applicants number can be downloaded by trials, for a maximum of 200 AINs. Moreover, the information provided need to be managed before the use.

Page 43 of 199 NACE divisions are shown in table 2. We observe (the last column of the table) that about 44% of the patenting enterprises are concentrated in Nace 28, 25, 46 and 22, i.e. Manufacture of machinery and equipment n.e.c., Manufacture of fabricated metal products, except machinery and equipment, Wholesale trade, except of motor vehicles and motorcycles and Manufacture of rubber and plastic products, respectively. In the last two rows of table 2, we observe that, as expected, more than half of the population of patenting enterprises (65%) have a size greater or equal to ten employees (the highest class considered), while the population of enterprises with more than 10 employees represents only 5% of the entire population of enterprises.

The second type of analysis is represented by the study of some special subpopulations. As an example, one could analyse the structure of enterprises patenting in the biotech domain. This analysis is possible since information on the structure of enterprises (principal economic activity and/or number of employees) is available in the ISTAT business registers while information on the biotech-related patents may be retrieved from the PATSAT database. 204 enterprises among the 9105 enterprises being active in 2009 applied for a biotech-related patent. These 204 enterprises are distributed over 28 Nace divisions, but two of them cover by themselves more than a half of the biotech- patenting enterprises. These two Nace divisions are 21 and 72, i.e. Manufacture of pharmaceuticals, medicinal, chemical and botanical products and Scientific research and development, respectively. The distribution of biotech-patentitng enterprises is shown in table 3.

Table 3: Distribution of biotech-patenting enterprises which are active in 2009.

Nace SizeClass Total 1 [2-9] [10-99] [100- 21 2 1 14 36 53 1 72 9 20 18 3 50 2 20 0 6 8 7 21 3 46 0 4 11 4 19 4 28 Total 20 44 71 69 204 9.8 21.6 34.8 33.8

As previously detailed, the PATSTAT database contains about 300.000 record related to the patents obtained by Italian applications. Once the link between enterprises and patents applications is established, it is possible to observe that about 90% of applications are performed by enterprises. The patents classification according to the IPC code is not related to the Nace classification. The classification according to the IPC code follows a hierarchical structure which is described at www.epo.org. The IPC- letter distribution of the 300.000 applications of Italian enterprises is shown in table 4. In figure 2, the distribution of the IPC codes that were found by the record linkage process is shown in red, while the original distribution of the same IPC codes is shown in black.

Page 44 of 199

Table 4: Distribution of the IPC of the Italian applications of enterprises.

IPC A B C D E F G H HUMAN PERFORMING CHEMISTRY; TEXTILES; FIXED MECHANICAL PHYSICS ELECTRICITY NECESSITIES OPERATIONSTR METALLURGY PAPER CONSTRUCTIONS ENGINEERING; ANSPORTING LIGHTING; HEATING; WEAPONS; BLASTING % 17 24 17 3 4 10 10 15 Pop

Figure 2: Distribution of original IPC codes (black) and linked IPC codes (red)

5. Conclusions and future plans

In this paper we illustrated the path followed at Istat in designing a linkage strategy to match micro-data on patent applications from PATSTAT and ASIA. The overall aim of this project is to identify the Italian patenting enterprises and characterise them through their economical information. In PATSTAT, the applicants resident in Italy and registering at least one patent in the period 1985-2010 have been considered. Patent applicants can be ‘individuals’ or ‘establishments’. At this stage, the linkage process aimed at identifying, among establishments, the business enterprises recorded in ASIA in the period 1989-2009. The overlapping information between the two archives reliable as matching variables in the linkage process mainly consists only of the ‘applicants names’ and the ‘postal code’. Moreover, the size of the business register ASIA in terms of number of records represents a computational problem to be faced. Therefore, a great effort has been put in the pre- processing phase of the process to standardise the applicant/enterprise names and some ‘search space’ reduction techniques have been adopted. Among the latter, particularly

Page 45 of 199 effective has proved to be the ‘blocking by neighbourhood’ technique. Assuming that, for a given patenting enterprise, at least one word in the ‘applicant name’ (in PATSTAT) and the ‘enterprise name’ (in ASIA) is correctly registered in both the archives, the ‘neighbourhood’ of an applicant name is defined as the set of enterprises which have a name containing at least one word equal to a word in the applicant name. Then, the correct link for the given applicant have been searched within its neighbourhood. At this development stage, around 84% of patenting enterprises (13526 out of 16132 applicants) were identified as ‘establishments’. The next step will be to define the ‘neighbourhood’ on the base of similarity between words instead of equality, in order to manage typing errors. Some further improvements might be obtained by using the address instead of the Zip Code. In future work, it would be desirable to classify the whole set of patenting establishments as: business enterprises, public institutions, non-profit institutions and private or public universities enterprises, according to the Frascati Manual (2002). For applicants without legal form, it is planned to use different archives (such as the List of enterprise manager or the List of companies partners). Finally, a probabilistic approach to the record linkage could be derived by using the R&D survey frame test set.

References

OECD (2002). Frascati Manual 2002: Proposed Standard Practice for Surveys on Research and Experimental Development, Paris 2002. Istat (2003), Metodi statistici per il record linkage, Metodi e Norme n. 16, Anno 2003, a cura di Mauro Scanu Istat (2011), RELAIS - Record linkage at Istat, software and User’s guide available at: http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/

Page 46 of 199 Linking Information to the Australian Bureau of Statistics Census of Population and Housing in 2011

Graeme Thompson Australian Bureau of Statistics, ABS House, 45 Benjamin Way, Belconnen ACT 2617, Australia, [email protected]

Abstract: The Australian Bureau of Statistics will be undertaking a suite of data integration projects linking ABS and non-ABS data to the 2011 ABS Census of Population and Housing. The process of undertaking the integration projects can be mapped to the Generic Statistical Business Process Model (GSBPM) to aid in discussions of developing statistical metadata systems and processes.

Keywords: data integration, business process, census, Australia

Views expressed in this paper are those of the author, and do not necessarily represent those of the Australian Bureau of Statistics. Where quoted, they should be attributed clearly to the author.

1. Aim of the Paper

The aim of this paper is to provide an understanding of how the Census Data Enhancement project will be linking both ABS and non-ABS data to the ABS Census of Population and Housing conducted in 2011, and how the GSBPM might be used for data integration projects.

2. Purpose

The functions of the Australian Bureau of Statistics as specified in the Australian Bureau of Statistics Act 1975 (ComLaw, 1975) include the maximum possible utilisation, for statistical purposes, of information, and means of collection of information, available to official bodies. Aligning with this function the Australian Statistician has set one of his key priorities for the organisation over the last 4 years to be “implementing a safe and effective environment for the use of, and integration of, microdata for statistical and research purposes”.

3. Census Data Enhancement (CDE)

A key project for the ABS is the Census Data Enhancement (CDE) project. This term is used to describe several projects which link ABS and non-ABS data to the ABS Census of Population and Housing.

Commencing with the 2006 Census, the ABS began the CDE project to enhance the value of the Census data by bringing it together with other datasets to leverage more information from the combination of individual datasets than is available from the datasets separately.

Page 47 of 199

There are five major components to the 2011 CDE project:

1. Bringing together 2011 Census data with a small number of predetermined datasets during Census processing using name and address, for quality studies; 2. Bringing together 2011 Census data with a small number of predetermined datasets during Census processing using name and address, for statistical studies; 3. Wave 2 of a 5% Statistical Longitudinal Census Dataset (SLCD); 4. Bringing together the SLCD with other datasets without using name and address for statistical and research purposes; and 5. Bringing together 2011 Census data with other datasets without using name and address after Census processing.

A fundamental aspect of the CDE project is the management of confidentiality and privacy.

The ABS Census of Population and Housing is a cornerstone of official Australian statistics. The co-operation of respondents is critical in ensuring high quality statistical outputs. One measure that encourages respondent participation are specific undertakings that the ABS makes regarding the Census around the destruction of Census forms and the deletion of name and address information once Census processing has been completed. Some of the CDE projects require the use of name and address for linking purposes, so these projects can only be completed while the Census is being processed (from Census night until approximately 15 months later). An undertaking has been given to the Australian public that linked files created using name and address will be deleted once their specified purpose has been met.

The CDE project was first proposed during the 2006 Census cycle. The ABS held extensive consultation (including a Discussion Paper: Enhancing the Population Census: Developing a Longitudinal View, 2006 (ABS, 2006a)) around the scope of the project and commissioned a Privacy Impact Assessment (Waters, 2005) by an independent body. The Australian Statistician then determined the scope of the CDE project for the 2006 Census cycle, and this was published on the ABS website as an Information Paper ABS Cat. No. 2062.0 (ABS, 2006b).

The scope of the CDE project for the 2011 Census cycle is marginally changed from the 2006 cycle. As such it was not considered necessary to undertake a new Privacy Impact Assessment, nor have the extensive consultation that preceded the 2006 CDE project. However, ABS did consult Privacy Commissioners in all jurisdictions, including the Federal Privacy Commissioner. A number of focus groups were held before the 2011 CDE project to assist the ABS to judge the community attitude to data linking, in particular the ABS conducting linkage projects and specifically linking data to the ABS Census. It is the Australian Statistician’s position that the ABS should proceed with data linkage projects in line with community acceptance of conducting such linkage.

For full details of the 2011 CDE project, see the Information Paper – Census Data Enhancement Project: An Update, October 2010 (ABS, 2010a).

Page 48 of 199 4. The (statistical business) process of data linking

The data linking process can be mapped to the Generic Statistical Business Process Model (GSBPM) as approved by the METIS Steering Group of the United Nations Economic Commission for Europe (UNECE, 2009). In this paper the focus will be on some of the relevant phases of the GSBPM and how they relate to the 2011 CDE project.

Figure 1 Generic Statistical Business Process Model

In the figure above, there are nine phases (Specify Needs through to Evaluate), and each phase has a number of sub-processes. The following sections map the GSBPM to a data linking project using the CDE experience as an example.

5. Phase 1: Specify Needs

1.1 Determine needs for information There are a number of projects within the CDE umbrella, and each of these has been approved based on a recognised need for information. The understanding of needs is based on extensive consultation that the ABS undertakes with stakeholders. Projects will only be undertaken where there is a clear public benefit.

Page 49 of 199 1.2 Consult and confirm needs The ABS has an extensive ongoing consultation process with stakeholders. An important part of the consultation process is a range of user groups convened by the ABS to assist in determining data needs, a list of these groups is available in the ABS Annual Report (ABS, 2011).

1.3 Establish output objectives The output objectives were defined in the Information Paper (ABS, 2010a), including details of retention policies and availability of access for people outside the ABS.

1.4 Identify concepts Concepts are available based on existing metadata available for the source data. Work in this sub-process for CDE is largely around alignment of concepts from the different sources that are to be linked, and updating existing metadata where transformations are applied to data sources, for example occupation codes may be different depending on the classification used in coding on different data sources.

1.5 Check data availability The basis for the CDE project is Census data. This data can become available (internally in the ABS) progressively as the Census is being processed. A critical component of the Census data necessary for many of the CDE linking projects is the availability of name and address. As discussed earlier name and address data is only available for a limited time.

Access to other data to be linked to the Census data needs to be negotiated with the custodians of the data. These custodians can be a single institution, or distributed across the States and Territories that make up the Australian Commonwealth (e.g. Registrars of Births, Deaths and Marriages).

The Australian Government is building a governance structure for integration of Commonwealth data in a safe and effective environment, see the National Statistical Service website for more details (Cross Portfolio Data Integration Oversight Board, 2011).

1.6 Prepare business case Business cases (and other project management documentation) have been prepared for CDE projects. In 2006 focus groups and the PIA were part of an initial business case for ABS to get involved in data linking using Census data in the first place.

6. Phase 2: Design

2.1 Design outputs A range of possible outputs have been proposed. From specific outputs from particular projects (e.g. adjustment factors for Indigenous life expectancy estimates) to general outputs (e.g. the possibility of unit record files particularly the 5% SLCD).

2.2 Design variable descriptions Variable descriptions are generally available for the source datasets. Derived variables for use in linking will need descriptions when file standardisation (e.g. common

Page 50 of 199 variables from the files to be linked are of the same type (i.e. character/numeric) etc.) and field standardisation (e.g. ensuring variables to be compared have compatible categories – this includes name standardisation) is undertaken.

2.3 Design data collection methodology Ensure secure methods are in place to acquire data (e.g. ABS have a secure deposit box facility for external agencies to provide data over the internet). It is also necessary to have appropriately secure methods of moving data within the ABS (following the principle of functional separation based on Kelman (Kelman, Bass, & Holman, 2002)).

2.4 Design frame and sample methodology Most CDE projects make use of entire input files. The SLCD will be based on a 5% sample as a privacy preserving mechanism.

2.5 Design statistical processing methodology The CDE project will use probabilistic linking methodology following Fellegi-Sunter (Fellegi & Sunter, 1969). Blocking and linking strategies will be designed for each linkage project. Methods for calculation of m and u probabilities will depend on the data sources to be linked. The clerical review strategy is designed based on the method outlined in an ABS methodology paper see (Guiver, 2011). Quality gates have been designed to enable quality to be monitored throughout the linking process (for more information about quality gates see (ABS, 2010b)).

2.6 Design production systems and workflow

Oracle SAS extract / Febrl SAS extract / Oracle data store transform / data linking transform / data store load load

Oracle data store holds the input files (brought in through the data collection methodology). Standardisation (Design variable descriptions) is done in SAS and then files are created for input to Febrl (Freely Extensible Biomedical Record Linkage), open source data linking software – see (Christen, et al., 2005). Febrl does the data linking (including enabling clerical review). Output files (i.e. linking keys which allow the input files to be linked) from Febrl are loaded by SAS back into the Oracle data store.

The linked files can then be extracted from the Oracle data store and analysed or transformed into other output products (e.g. confidentialised unit record files).

7. Phase 3: Build

3.1 Build data collection instrument Existing facilities within the ABS have been adapted for loading of sensitive linking datasets, with functional separation (see sub-process 2.3 above) implemented, as a privacy and security measure.

3.2 Build or enhance process components Existing ABS infrastructure is available with minimal modifications. Standardisation (file and field) processes need to be built in SAS.

Page 51 of 199 Febrl has been significantly enhanced by ABS (to enable multiprocessing, viewing snippets of Census forms for clerical review, clerical review functionality, categorical probability assignment).

Quality gates have been built around each data linking process (including the ability to extract management information at critical points in the process and checklists for running through data linking projects).

3.3 Configure workflows Systems have been built to enable data movements, with management information extraction available at critical points. End to end training materials have been produced to ensure people working on data linking are able to perform efficiently and effectively.

3.4 Test production system Test datasets have been created. A simulated Census dataset has been created based on 2006 Census data with the random addition of name and address for load testing purposes. Smaller datasets have also been created for system testing purposes.

Robust change management procedures need to be put in place to ensure the building of data linking infrastructure as an ongoing process can take place – this includes having test, development and production environments as well as governance in place to ensure build changes are acceptable to stakeholders.

3.5 Test statistical business processes Census Dress Rehearsal (CDR) data is available to test statistical business processes using as close to live data as possible. Other datasets (for linking to the CDR) are also available in many instances (e.g. mortality data for the period after the CDR). CDR data will also be linked to Census data once that becomes available to provide a quality benchmark for CDE projects (especially the SLCD).

3.6 Finalise production systems Governance processes are in place to enable sign-off of infrastructure into a production environment. Internal access arrangements have been formalised to allow appropriate access for those performing the linkage and those doing analysis of linked information.

Internal training materials have been produced covering the full end to end process.

8. Phase 4: Collect (Acquire in data linking terms)

4.1 Select sample In the case of the SLCD this sub-process is where the 5% sample is selected. Other projects do not have a sampling basis.

4.2 Set up collection Prepare for the arrival of data, ensuring appropriate accesses are in place in the computer systems. In the case of CDE this includes internal ABS Census data, as well as external administrative data.

Page 52 of 199 4.3 Run collection Take snapshots of data at points in time from various sources. This includes snapshots of Census data as it is still being processed (meaning that certain variables will not be populated depending on the stage of Census processing when the snapshot is taken).

Extract management information from input files at each snapshot.

4.4 Finalise collection Create linkage files (merge files appropriately, file standardise, field standardise). As part of this step detailed data quality reports are produced for each input dataset.

9. Phase 5: Process

5.1 Integrate data [recursive with respect to the chosen blocking strategy] • Link files • Threshold review • Clerical review Create and output linking keys so original input files can be linked in future.

Extract management information and ensure quality gates operate appropriately.

5.2 Classify and code In data linking this is the final assignment of link status.

5.3 Review, validate and edit Analyse unlinked records. In some cases (particular population groups for example) all possible links might be reviewed.

5.4 Impute Imputation is not used as part of the CDE project. Imputed records on input files are generally disregarded.

5.5 Derive new variables & statistical units These will usually be available from the input files used for integration, or can be merged from associated output files that are generally created from those files for other purposes.

An issue with data linking is where the same variable exists on input files and for a link has different values on each file, and a choice (or derivation) needs to be made to produce a final value. This is the case for the Indigenous Mortality project where Indigenous status is available on both the mortality records and the Census.

5.6 Calculate weights Weights may need to be calculated for the SLCD. During the scoping exercise conducted as part of the 2006 Census cycle, some investigation was done which included the calculation of weights for the Census unit record file to Census dress rehearsal file, a project undertaken to assess the likely quality of census linking across cycles (Bishop, 2009).

Page 53 of 199

5.7 Calculate aggregates Produce an output report using the linked files.

5.8 Finalise data files Ensure link keys are stored appropriately to allow linked files to be created.

10. Phase 6: Analyse

6.1 Prepare draft outputs Create quality declaration documents for linkage files. Quality information will be available for the linkage files based on the management information extracted as part of the quality gate process, as well as compiled information about unlinked records.

Create output files (e.g. adjustment factor analysis files, CURFs, other analysis files). These would be created subject to the functional separation principle where only variables required for the analysis being undertaken would be included see (Kelman, Bass, & Holman, 2002).

6.2 Validate outputs Check linked data against population estimates to ensure consistency. Linkage rates need to be calculated and assessed (including types of linkage error). Possible future work related to this could be a taxonomy of data linking quality with terminology that assists in understanding the quality of linked datasets – similar to survey quality terminology (e.g. relative standard errors, non-response etc.).

Compare quality measures with previous (2006) study. Some measures of false match and true non-match rates were calculated for CDE studies conducted in 2006 and these will be calculated for the 2011 project. Linkage rates for particular projects and sub- populations within those projects will be calculated and compared against 2006 CDE results (particularly Indigenous mortality linkage rates compared with non-Indigenous).

Confront with external data sources, at an aggregate level. This includes comparing linked results with any existing external information (such as estimated resident population, mortality statistics etc.).

6.3 Scrutinise and explain Check how the linked file reflects initial expectations. This could include linkage rates for particular population groups.

View statistics from the linked file from different perspectives (in particular based on different geographies).

Undertake in depth analysis (e.g. in the case of the Indigenous Mortality study this involves calculating adjustment factors for life expectancy estimates).

6.4 Apply disclosure control Confidentialise unit record files (this includes governance implications such as ensuring appropriate levels of clearance before external release). ABS has sophisticated methods

Page 54 of 199 for confidentialising household survey unit record files and most (if not all) of these techniques would be applicable to linked files.

Prepare files for the remote execution environment for microdata (REEM – see (ABS, 2010c)). This REEM will provide analysis services which will access detailed de- identified microdata, with confidentiality routines built into the generated outputs to ensure that they are confidentialised in line with ABS legislative requirements and can be released as outputs.

6.5 Finalise outputs This sub-process is the same as the GSBPM.

11. Phase 7: Disseminate

The sub-processes in this phase (listed below) are the same as for the GSBPM and in the case of the ABS generally apply to already existing corporate infrastructure.

7.1 Update output systems 7.2 Produce dissemination products 7.3 Manage release of dissemination products 7.4 Promote dissemination products 7.5 Manage user support

12. Phase 8: Archive

8.1 Define Archive rules The ABS has detailed data management plans and policies and these will be applied to linked data files with a small number of exceptions. As mentioned above there are a small number of projects linking Census to other files using name and address. These files will be stripped of the name and address fields at the conclusion of Census processing, and the linkage keys (and any linked datasets) will be destroyed once the purpose of the linkage study has been met.

8.2 Manage archive repository The ABS archive repository currently exists and is well developed. Data linking will introduce some minor changes due to the implementation of functional separation (in this case meaning that a ‘librarian’ role will be required to create linked files for ‘analysts’) and some minor adaptations to manage linkage keys.

8.3 Preserve data and associated metadata The ABS already has infrastructure in place to manage this sub-process.

8.4 Dispose of data and associated metadata The ABS already has infrastructure in place to manage this sub-process.

13. Phase: 9 Evaluate

9.1 Gather evaluation inputs

Page 55 of 199 The quality gate implementation and extraction of management information will play a key role in gathering evaluation inputs.

9.2 Conduct evaluation This sub-process is the same as the GSBPM.

9.3 Agree action plan This sub-process is the same as the GSBPM.

14. Conclusion

ABS are building infrastructure to enable data integration projects to be completed successfully. An end to end approach is being taken, building infrastructure in areas where it does not exist, modifying existing infrastructure for specific linking purposes, and using existent infrastructure where it needs no modification.

One area where ABS is building infrastructure is in the Specify Needs phase of the GSBPM, where the ABS is collaborating with other Australian Government agencies to build a governance structure for integration of Commonwealth data in a safe and effective environment. Some aspects of this infrastructure are already in place (e.g. a set of principles to govern integration of Commonwealth data for statistical and research purposes, a Cross Portfolio Data Integration Oversight Board chaired by the Australian Statistician). Other aspects are in the process of being built including a process to accredit agencies to be an integrating authority to enable them to undertake “high risk” projects involving Commonwealth data.

An area where ABS has modified existing data linking infrastructure is the development of the Febrl linking software. Febrl version 0.3 has been significantly enhanced by the ABS to enable multiprocessing, viewing snippets of Census forms for clerical review, clerical review functionality, and categorical probability assignment. The clerical review modifications were offered to the original author of the software, but have not been included in the most recent version (Febrl 0.4).

There are many examples of existing ABS infrastructure meeting the needs of a data linking project. This is especially true in the dissemination phase of the GSBPM, where ABS corporate infrastructure has been used for many years to deliver output, and this infrastructure will work equally well for data linking.

The GSBPM has provided a useful structure for the CDE data linking projects, giving an end-to-end perspective of the process and ensuring that appropriate infrastructure is available for each step. It has also been very useful in planning collaboration across the ABS as many different areas are involved in a data linking project.

Page 56 of 199 15. Future work

The ABS is planning on conducting research during our 2011 CDE projects with particular emphasis on designing standardisation procedures (sub-process 2.2 Design variable descriptions), designing methods for calculation of m and u probabilities (sub- process 2.5 Design statistical processing methodology), and exploring data quality (sub- process (sub-process 6.2 Validate outputs). This research is part of the continuous improvement that underlies activities undertaken at ABS.

Data linking is a key priority for the ABS, and with developments in Australia to build a safe and effective environment for the integration of Commonwealth data there will be increased data linkage being undertaken to leverage more information by combining individual datasets. Positioning data linking within the GSBPM will allow organisations to agree on standard terminology to aid discussions on developing statistical systems and processes.

Page 57 of 199 References ABS. (2006a, April). Discussion Paper: Enhancing the Population Census: Developing a Longitudinal View. Retrieved from ABS Website: http://www.abs.gov.au/AUSSTATS/[email protected]/Lookup/2060.0Main+Features12006 ABS. (2006b, June). Census Data Enhancement Project: An Update. Retrieved from ABS Website: http://www.abs.gov.au/AUSSTATS/[email protected]/allprimarymainfeatures/43185D34D6A1FF51CA 2577BC0081EAC3?opendocument ABS. (2010a, October). Census Data Enhancement Project: An Update. Retrieved from ABS website: http://www.abs.gov.au/ausstats/[email protected]/mf/2062.0 ABS. (2010b, December). Quality Management of Statistical Processes Using Quality Gates, Dec 2010. Retrieved from ABS Website: http://www.abs.gov.au/ausstats/[email protected]/mf/1540.0 ABS. (2010c, Sep). 1504.0 - Methodological News, Sep 2010. Retrieved from ABS website: http://www.abs.gov.au/AUSSTATS/[email protected]/Lookup/1504.0Main+Features3Sep+2010 ABS. (2011, Oct). 1001.0 - Australian Bureau of Statistics -- Annual Report, 2010-11. Retrieved from ABS website: http://www.abs.gov.au/ausstats/[email protected]/d36c95a5d2ce6cedca257098008362c8/01776d8d7f87 e4e2ca25709900222520!OpenDocument Bishop, G. (2009, Aug). 1351.0.55.026 - Research Paper: Assessing the Likely Quality of the Statistical Longitudinal Census Dataset, August 2009. Retrieved from ABS website: http://www.abs.gov.au/AUSSTATS/[email protected]/mf/1351.0.55.026 Christen, P., Churches, T., Hegland, M., Taylor, L., Lim, K., Willmore, A., et al. (2005, April). Parallel Large Scale Techniques for High-Performance Record Linkage. Retrieved from ANU Data Mining Group: http://datamining.anu.edu.au/linkage.html ComLaw. (1975). Australian Bureau of Statistics Act. Retrieved from Australian Government ComLaw: http://www.comlaw.gov.au/ComLaw/Legislation/ActCompilation1.nsf/0/D457D9DA71AE7F49 CA25744B001DC54C/$file/AustBurStatAct1975WD02.pdf Cross Portfolio Data Integration Oversight Board. (2011). Statistical Data Integration involving Commonwealth Data. Retrieved from National Statistical Service: http://www.nss.gov.au/nss/home.nsf/pages/Data+Integration+Landing+Page?OpenDocument Fellegi, I. P., & Sunter, A. B. (1969). A Theory for Record Linkage. Journal of the American Statistical Association, 1183-1210. Guiver, T. (2011, May). Research Paper: Sampling-Based Clerical Review Methods in Probabilistic Linking. Retrieved from ABS Webstie: http://www.abs.gov.au/AUSSTATS/[email protected]/mf/1351.0.55.034 Kelman, C. W., Bass, A. J., & Holman, C. D. (2002). Research use of linked health data - a best practice protocol. Australian and New Zealand Journal of Public Health, 251-255. UNECE. (2009). Generic Statistical Business Process Model. Retrieved from UNECE website: http://www1.unece.org/stat/platform/display/metis/The+Generic+Statistical+Business+Process+ Model Waters, N. (2005, June). Privacy Impact Assessment. Retrieved from ABS Website: http://www.abs.gov.au/Websitedbs/D3110124.NSF/f5c7b8fb229cf017ca256973001fecec/fa7fd3 e58e5cb46bca2571ee00190475!OpenDocument

Page 58 of 199

Section II – Statistical Mathing

Page 59 of 199 Measuring uncertainty in statistical matching for discrete distributions. Pier Luigi Conti1, Daniela Marella2 1 Dipartimento di Scienze Statistiche, Sapienza Universita` di Roma 2 Dipartimento di Scienze dell’Educazione, Universita` “Roma Tre” e-mail: [email protected]

Riassunto: The aim of this paper is to analyze the uncertainty in statistical matching. The notion of uncertainty is first introduced, and a measure of uncertainty is then introduced. Moreover, the reduction of uncertainty in the statistical model due to the introduction of logical constraints is studied.

Keywords: Statistical Matching, contingency tables, structural zeroes, non- identifiability, uncertainty.

1. Introduction

Let (X,Y,Z) be a three-dimensional random variable (r.v.), and let A and B be two independent samples of nA and nB i.i.d. records from (X,Y,Z), respectively. Assume that the marginal (bivariate) (X,Y ) is observed in A, and that the marginal (bivariate) (X,Z) is independently observed in B. The main goal of statistical matching, at a macro level, consists in estimating the joint distribution of (X,Y,Z). Such a distribution is not identifiable due to the absence of joint information on Z and Y given X; see D’Orazio et al. (2006b). Generally speaking, two approaches have been considered to ensure the identifiability of the joint distribution of (X,Y,Z):

• techniques based on the conditional independence assumption between Y and Z given X (CIA assumption: (see, e.g., Okner, 1972); • techniques based on external auxiliary information regarding the statistical relationship between Y and Z, e.g. an additional file C where (X,Y,Z) are jointly observed is available, as in Singh et al. (1993).

Unfortunately, since CIA assumption is rarely met in practice (see, e.g., Rodgers, 1984 and Sims, 1972), and external auxiliary information is hardly ever available, the sample observations cannot identify the statistical model generating data. In other terms, the sampling mechanism does not allow one to identify the joint distribution of (X,Y,Z), but only a class of possible distributions of (X,Y,Z). Roughly speaking, this produces uncertainty about the actual distribution of (X,Y,Z) in the above mentioned class, even when the marginal distributions of (X,Y ) and (X,Z) are known. Of course, the sampling mechanism is actually unable to identify the conditional distribution of (Y,Z) given X. This is the actual reason for the lack of identifiability of the distribution of (X,Y,Z). Hence, considering uncertainty about the conditional distribution of (Y,Z) given X is equivalent to consider uncertainty on the distribution of the triple (X,Y,Z).

Page 60 of 199 In our setting, the main task consists in providing a precise definition of uncertainty on the (estimated) model, and in constructing a coherent measure that can reasonably quantify such an uncertainty. We confine ourselves to the case of ordered categorical variables. The case of discrete variables with nominal values is dealt with in D’Orazio et al. (2006a).

2. Uncertainty in statistical matching for ordered categorical variable

Assume that, given a discrete r.v. X with I ordered categories, Y and Z are discrete r.v.s too, with J and K ordered categories, respectively. Without loss of generality, from now on the symbols i = 1,...,I, j = 1,...,J, and k = 1,...,K, will denote the (ordered) categories taken by X, Y and Z, respectively. Let γjk|i be the conditional probability P r(Y = j, Z = k|X = i), and denote by φj|i = P r(Y = j|X = i) and ψk|i = P r(Z = k|X = i) the corresponding marginal probabilities of Y and Z (again conditionally on X), respectively. For real numbers a, b, define further the two quantities U(a, b) = min(a, b),L(a, b) = max(0, a + b − 1). (1) Conditionally on X = i, the distribution functions (d.f.’s) of (Y,Z), Y , and Z, are equal to

j k X X Hj,k|i = γyz|i, j = 1, . . . , J, k = 1, . . . , K, i = 1,...,I, j=1 z=1

j X Fj|i = φy|i, j = 1, . . . , J, i = 1,...,I, y=1 k X Gk|i = ψz|i, k = 1, . . . , K, i = 1,...,I, z=1 respectively. Using the same arguments as in Conti et al. (2009), the inequalities

L(Fj|i,Gk|i) ≤ Hj,k|i ≤ U(Fj|i,Gk|i). (2) hold true. Inequalities (2) imply that

− + γjk|i ≤ γjk|i ≤ γjk|i, (3) where

− γjk|i = L(Fj|i,Gk|i) − L(Fj−1|i,Gk|i) − L(Fj|i,Gk−1|i) + L(Fj−1|i,Gk−1|i)

+ γjk|i = U(Fj|i,Gk|i) − U(Fj−1|i,Gk|i) − U(Fj|i,Gk−1|i) + U(Fj−1|i,Gk−1|i). Now, it is not difficult to realize that

− γjk|i ≥ L(φj|i, ψk|i) + γjk|i ≤ U(φj|i, ψk|i)

Page 61 of 199 so that inequalities (3) are sharper than elementary Frechet´ inequalities applied to probabilities γjk|is. The interval [L(Fj|i,Gk|i),U(Fj|i,Gk|i)] in (2) summarizes the pointwise uncertainty about the statistical model for every triple (i, j, k) of categories. It is intuitive to take the length of such an interval as a pointwise uncertainty measure. Formally

jk|i ∆ = U(Fj|i,Gk|i) − L(Fj|i,Gk|i). (4) for each point (i, j, k). The larger the measure ∆jk|i the more uncertain the statistical model generating the data w.r.t. (i, j, k). Clearly, if the model is identifiable, then the interval reduces to a single point, with length zero, and there is no uncertainty at all. In order to summarize the pointwise differences in (4) into an overall measure of uncertainty we may take the average length Z ∆ = ∆jk|i dT (i, j, k) R3 where T (i, j, k) is a weight function on R3, i.e. a measure having total mass 1. A “natural” choice consists in taking

d T (i, j, k) = d F (j|i) d G(k|i) d Q(i) = φj|i ψk|i ξi.

This distribution is “natural” because: i) it is the simplest choice given the available d.f.s F (j|i), G(k|i), Q(i) and makes the integral in ∆ easily computable in many cases; ii) among all the possible associations between Y and Z, we consider a neutral position, i.e. we do not give preference to any specific positive or negative association. Hence, a conditional measure of uncertainty is

J K x=i X X jk|i ∆ = ∆ φj|i ψk|i (5) j=1 k=1 As a matter of fact, by averaging (5) with respect to X, we obtain the overall measure of uncertainty ∆:

I X x=i ∆ = ∆ ξi. (6) i=1 Relationships (5), (6) show that the unconditional uncertainty measure (6) can be expressed as a weighted mean of conditional uncertainty measures (5). Then, the larger ∆x=is, the more uncertain the data generating statistical model.

3. Reducing uncertainty under constraints 3.1. Structural zeros and regular domains

In several real cases, uncertainty about the joint distribution of Y and Z can be considerably reduced by introducing appropriate logical constraints among the values taken by Y and Z. Precisely, we consider constraints acting as structural zeroes (cfr. Agresti, 1990), i.e. constraints that make equal to 0 some of the joint probabilities

Page 62 of 199 γjk|i = P r(Y = j, Z = k|X = i). Of course, this is equivalent to assume that logical constraints “reduce” the support of the joint distribution of Y and Z (given X), which is strictly smaller than the Cartesian product of the supports of Y and Z. In the sequel, we will concentrate on structural zeros that reduce the support of Y and Z in a “regular way”, useful to manage uncertainty. To introduce the kind of constraints we will deal with, consider the support of (Y,Z), which is a subset (either proper or improper) of {(j, k); j = 1,...,J; k = 1,...,K}. For each j ∈ {1,...,J}, define the two integers:

+ kj = largest integer k such that γjk|i > 0; − kj = smallest integer k such that γjk|i > 0. j j k+ = K k− = 1 Of course, there exist integers 1, 2 such that j1 and j2 . Analogously, for each k ∈ {1,...,K}, define the two integers:

+ jk = largest integer k such that γjk|i > 0; − jk = smallest integer k such that γjk|i > 0. Again, there exist integers k , k such that j+ = J and j− = 1. 1 2 k1 k2 The support of (Y,Z) (given X) is Y -regular if, for all j = 1,...,J,

+ − γjk|i = 0 ∀ k > kj , γjk|i = 0 ∀ k < kj . (7) Similarly, the support of (Y,Z) (given X) is Z-regular if, for all k = 1,...,K,

+ − γjk|i = 0 ∀ j > jk , γjk|i = 0 ∀ j < jk . (8)

To visualize the meaning of Y -regularity, consider the piecewise straight line dy(j) − joining the downstairs points (j, kj ), j = 1, ... , J, and the piecewise straight line uy(j) + joining the upstairs points (j, kj ), j = 1, ... , J. Y -regularity means that structural zeroes are all points above uy(·) and below dy(·). Similar concepts hold in case of Z-regularity. Let dz(k) be the piecewise straight line − joining the downstairs points (k, jk ), k = 1, ... , K, and the piecewise straight line uz(j) + joining the upstairs points (k, jk ), k = 1, ... , K. Z-regularity means that structural zeroes are all points above uz(·) and below dz(·). The case of a Y -regular support (which is not Z-regular) is illustrated in Figure 1.

3.2. Constrained lower and upper bounds

The constraints above introduced on the support of the joint distribution of (Y,Z) can be used to improve the lower and upper bounds for Hjk|i given in (2). Before attacking the problem in its full generality, let us shortly provide the main idea via a few examples.

Example 1 Let u(j) be a monotone increasing function and suppose that the line d(j) does not exist. As a consequence structural zeros are all points above u(·). From the relationship

Hjk|i = Hjkj |i ∀k > kj (9)

Page 63 of 199 Figure 1: Structural zeros in a Y -regular domain.

we obtain

Hjk|i ≤ min(Fj|i,Gk|i)

Hjk|i ≤ min(Fj|i,Gkj |i) (10) then

Hjk|i ≤ min(Fj|i,Gkj |i) k > kj (11)

It is straightforward to prove that the lower bound does not improve. In fact, if k > kj

Hjk|i ≥ max(0,Fj|i + Gk|i − 1)

Hjk|i ≥ max(0,Fj|i + Gkj |i − 1) (12) then

Hjk|i ≥ max(0,Fj|i + Gk|i − 1). (13)

Example 2 Let d(j) be a monotone decreasing function and suppose that the line u(j) does not exist. As a consequence structural zeros are all points below d(·). From the relationship

Hjk|i = Hjkj |i ∀k < kj (14) we obtain

Hjk|i ≥ max(0,Fj|i + Gk|i − 1)

Hjk|i ≥ max(0,Fj|i + Gkj |i − 1). (15) then

Hjk|i ≥ max(0,Fj|i + Gkj |i − 1) ∀k < kj (16)

Page 64 of 199 It is straightforward to prove that the upper bound does not improve. In fact, if k < kj

Hjk|i ≤ min(Fj|i,Gk|i)

Hjk|i ≤ min(Fj|i,Gkj |i) (17) then

Hjk|i ≤ min(Fj|i,Gk|i) k < kj. (18) Let us now turn to the general case of structural zeroes. In order to see this to improve the lower and upper bounds for Hjk|i in (2). Suppose that the domain of (Y,Z) (given X) in Y -regular, and take a fixed j. From

− γjk|i = 0 ∀ k < kj it follows that

− Hjk|i = Hj−1k|i ∀ k < kj . (19) + On the opposite size, as k is greater than kj from + Fj|i − Hjk|i = Fj−1|i − Hj−1k|i ∀ k > kj the relationship

+ Hjk|i = Fj|i − Fj−1|i + Hj−1k|i ∀ k > kj (20) follows. The two relationships (19), (20) can be used to construct bounds for Hjk|i better than the unconstrained bounds (2), whenever the support of (Y,Z) given X in Y -regular and/or Z-regular. − + Let Hjk|i and Hjk|i be the lower and upper bounds for Hjk|i obtained by using relationships (19), (20). It is not difficult to see that the pair of inequalities

+ − Hjk|i ≤ U(Fj|i,Gk|i),Hjk|i ≥ L(Fj|i,Gk|i) ∀ j = 1, . . . , J, k = 1,...,K + − holds, so that Hjk|i, Hjk|i improve the unconstrained bounds in (2) whenever the support of (Y,Z) given X is Y -regular and/or Z-regular. Furthermore, when there are no − + − + + − constraints (i.e. when kj = 1,, kj = K, jk = 1, jk = J), then Hjk|i, Hjk|i turn out to be equal to U(Fj|i,Gk|i) and L(Fj|i,Gk|i), respectively. + − Next, it is possible to see that Hjk|i and Hjk|i are continuous functions of F1|i,...,Fj|i, G1|i,...,Gk|i. In symbols: + + Hjk|i = fjk|i(F1|i,...,FK|i,G1|i,...,GK|i) (21) − − Hjk|i = fjk|i(F1|i,...,FK|i,G1|i,...,GK|i). (22)

y+ z+ y− z− + − The structure of Hjk|i, Hjk|i, Hjk|i, Hjk|i also shows that the functions fjk|i, fjk|i are piecewise linear, and hence differentiable for all but a finite number of points Fj|is, Gk|is. More precisely, the “non-differentiability” points are those where two or elements in the y+ z+ y− z− max(·) and/or min(·) terms defining Hjk|i, Hjk|i, Hjk|i, Hjk|i are equal. Again, the non- differentiability points only depend on the marginal d.f.s Fj|i, Gk|i, and on the constraints, as well.

Page 65 of 199 4. Estimation of the measure(s) of uncertainty

An important feature of the measures of uncertainty introduced so far is that they can x x be estimated on the basis of sample data. Let nA,i (nB,i) be the number of sample xy xz observations is sample A (B) such that X = i, and let nA,ij (nB,ik) be the number of observations in sample A (B) such that X = i and Y = j (X = i and Z = k), i = 1,...,I, j = 1,...,J, k = 1,...,K. The probabilities ξi, φj|i, ψk|i can be then estimated by the corresponding sample proportions

x x nA,i + nB,i ξbi = , i = 1,...,I; nA + nB xy nA,ij φbj|i = x , i = 1, . . . , I, j = 1,...,J; nA,i xz nB,ik ψbk|i = x , i = 1, . . . , I, k = 1,...,K. nB,i

Furthermore, the c.d.f.s Fj|i, Gk|i can be estimated by the corresponding empirical distribution functions (e.d.f.s): xy xy nA,i1 + ··· + nA,ij Fbj|i = x , i = 1, . . . , I, j = 1,...,J; nA,i xz xz nB,i1 + ··· + nB,ik Gbk|i = x , i = 1, . . . , I, k = 1,...,K. nAB,i

+ − As a consequence, the upper end lower bound for Hjk|i, Hjk|i, Hjk|i, can be estimated by

+ + Hbjk|i = fjk|i(Fb1|i,..., FbK|i, Gb1|i,..., GbK|i) (23) − − Hjk|i = fjk|i(Fb1|i,..., FbK|i, Gb1|i,..., GbK|i) (24) respectively. Hence, the conditional and unconditional measures of uncertainty can be estimated by

J K x=i X X  + −  ∆b = Hbjk|i − Hbjk|i φbj|i ψbk|i, (25) j=1 k=1 I X x=i ∆b = ∆b ξbi (26) i=1 respectively. The consistency of estimators (25), (26) is established in Proposition 1.

Proposition 1 Assume that nA(/nA + nB) → α as nA, nB go to infinity, with 0 < α < 1. Then ∆b x=i, ∆b converge almost surely (a.s.) to ∆x=i, ∆, respectively. In symbols:

x=i a.s. x=i ∆b → ∆ as nA → ∞, nB → ∞, i = 1,...,I; a.s. ∆b → ∆ as nA → ∞, nB → ∞.

Page 66 of 199 + − In the second place, using the piecewise differentiability of fjk|i, fjk|i, it is not too + − difficult to see that the estimators Hbjk|is, Hbjk|is are jointly asymptotically normally distributed, provided that the “true” Fj|is, Gk|is satisfy the differentiability condition mentioned in the above section. As a consequence, the following proposition holds.

Proposition 2 Assume that nA(/nA + nB) → α as nA, nB go to infinity, with 0 < α < 1, + − and that Fj|is, Gk|is satisfy the differentiability condition for fjk|is, fjk|is. Then

s x x nA,inB,i x=i x=i x x (∆b − ∆ ) nA,i + nB,i

2 does have normal asymptotic distribution with mean zero and positive variance σi as nA, nB tend to infinity. Similarly, the variate r nAnB (∆b − ∆) nA + nB

2 possesses normal asymptotic distribution with mean zero and positive variance σ as nA, nB tend to infinity.

2 2 The asymptotic variances σi s, σ do have a complicate form, depending on the “true” Fj|is, Gk|is . However, they can be consistently estimated by bootstrap method, that works as follows. 1. Generate from the e.d.f. of sample A a bootstrap sample of size nA. 2. Generate from the e.d.f. of sample B a bootstrap sample sample of size nB. 3. Use samples generated in steps 1, 2 to compute the “bootstrap version” ∆e x=i of ∆b x=i. x=i Steps 1-3 are repeated M times, so that the M bootstrap values ∆e m , m = 1,...,M x=i 2x are obtained. Let ∆ be their average, and let SM be their variance:

M M x=i 1 X 1 X x=i ∆ = ∆e x=i,S2x = (∆e x=i − ∆ )2. M m M M − 1 m m=1 m=1

2 As an estimate of σi , we may take

2 nA,x nB,x 2x σbi,M = SM . (27) nA,x + nB,x From (27) it is also easy to construct an estimate of the unconditional variance σ2. The above results are useful to construct point and interval estimates of the uncertainty measures ∆x=i, ∆. They are also useful to test the hypothesis that the class of bivariate + − d.f.s with upper bounds Hjk|is and lower bounds Hjk|i is “narrow”, when structural zeroes are considered.

Page 67 of 199 References

Conti,P.L., Marella, D., Scanu, M. (2009) How far from identifiability? A nonparametric approach to uncertainty in statistical matching under logical constraints, Technical Report, 22, DSPSA, Sapienza Universita` di Roma. D’Orazio, M., Di Zio, M., Scanu, M. (2006a) Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints, Journal of Official Statistics, 22, 1, 137-157. D’Orazio, M., Di Zio, M., Scanu, M., (2006b) Statistical Matching: Theory and Practice, Wiley, New York. Okner, B.A., (1972) Constructing a new data base from existing microdata sets: the 1966 merge file, Annals of Economic and Social Measurement, 1, 325-342. Rodgers, W.L., (1984) An evaluation of statistical matching, Journalof Business and Economic Statistics, 2, 91-102. Sims, C.A., (1972) Comments on: “Constructing a new data base from existing microdata sets: the 1966 merge file”, by B.A. Okner, Annals of Economic and Social Measurements, 1, 343-345.

Page 68 of 199 Statistical matching: a case study on EU-SILC and LFS

Aura Leulescu1, Mihaela Agafitei1 and Jean-Louis Mercy1.

1Eurostat, European Commission, Luxembourg, Luxembourg, L-2920.

Abstract

One of the main actions foreseen by the current process of modernization of social statistics within the ESS is the streamlining of social surveys in order to enable their complementary use. In the frame of these new developments, model based techniques are explored with the aim of meeting new demands through a better exploitation of existing data sources. This paper focuses on the estimation of regional poverty indicators based on the integration of information from two social surveys: SILC and LFS. EU- SILC is the reference source for poverty indicators, but in several countries regional estimates are not of adequate precision due to the small sample size. In practice, this exercise aims to draw on the larger sample size of LFS for providing poverty estimates for areas where SILC, on its own, is not sufficient to provide a valid estimate.

Keywords: statistical matching, social surveys, regional estimates 1. Introduction

1.1. The need for better information at regional level

In the context of demographic and economic problems, policy makers put great emphasis on the development of detailed and reliable indicators on poverty and living conditions that capture regional disparities. In August 2009, the "GDP and beyond" Communication emphasised the importance of key distributional issues, including the "equitable sharing of benefits across regions". In June 2010, Europe 2020 makes explicit the linkage with the cohesion policy and highlights the strong diversity among EU regions (e.g. differences in characteristics, opportunities and needs) and the need for a strong role for regions, cities and local authorities in decision-making. The last cohesion report1 emphasises that a key component of effectiveness for the cohesion policy is the alignment with Europe 2020, with a stronger focus on measurable results per region. Therefore, one critical need for policy makers is the provision of reliable regional measures for poverty indicators to be employed as benchmarks.

EU-SILC (European Union Statistics on Income and Living Conditions) provides the underlying data for the calculation of the headline indicator ‘Population at risk of poverty or exclusion’ and related indicators relevant to the headline target of reducing poverty of the Europe 2020 strategy. However, EU-SILC currently provides only partial information in terms of regional coverage, due to the relative small sample size in several countries. There are several countries for which direct regional estimates based on sample data are not of adequate precision due to large variances.

1 http://ec.europa.eu/regional_policy/sources/docoffic/official/reports/cohesion5/index_en.cfm

1

Page 69 of 199 1.2. A project for combining information from EU-SILC and LFS

The current process of modernization of social statistics within the ESS is focused on a better exploitation of existing data sources for meeting new demands. In the frame of these new developments, model based techniques (such as statistical matching and small area estimation) are explored within Eurostat in relation to specific practical needs in the field of social statistics: e.g. multidimensional measures for quality of life; poverty/health estimates at regional level; joint information on income, consumption and wealth.

Therefore, one specific stream investigated is the use of model-based methods for overcoming the problem of the small sample size for regional poverty indicators. These techniques are essentially based on statistically matching our sample with larger sample/auxiliary information in order to increase the precision of estimates.

This paper presents preliminary results on the estimation of regional poverty indicators based on the integration of EU-SILC with LFS. LFS can potentially be a good complement for this specific purpose as: it is accessible at Eurostat level and it covers all member states; it has an extensive coverage at regional level; it refers to the same population and contains a set of common variables at individual and household level. Practically, the exercise links poverty variables with covariates available in both surveys in order to impute poverty estimates for out-of-sample units (in LFS). The results illustrated in the paper refer to the integration of SILC-LFS data for only one country (Austria 2008). First results show that the integration process requires often specific solutions for different countries (different degrees of harmonization, different models, etc) and further work will need to explore the extent to which the current methodology can be applied at EU level.

The rest of the article is organized into three sections, following the main steps in the integration process. Section 2 summarises the process of coherence analysis and reconciliation between the two data sources both in terms of concepts and marginal/joint distributions. Section 3 presents the proposed methodology for building 'synthetic poverty estimates' that make use of related data from LFS. Section 4 concludes with a discussion of limitations and further methodological aspects which need to be tackled.

2. Coherence and reconciliation of sources

This first stage focused on assessing the existence of appropriate conditions for matching relative to the two sources involved: they should be independent samples of the same population and have the same unit of analysis; they share a common block of variables which are consistent in terms of definitions, scales, classifications, marginal and joint distributions. (D’Orazio et al, 2006)

In order to enable the integration of two or more datasets several harmonization actions needed to be undertaken so that the variables and their distributions could be made comparable. The harmonisation work required a careful consideration of both survey concepts and survey methods. Moreover, country-specific implementation aspects have to be considered. While efforts for harmonization across countries can foster a common integration approach at EU level, the exercise showed that the reconciliation of sources might require different solutions across countries.

2

Page 70 of 199 2.1. Reference populations and units of analysis

The reference population in both surveys is the resident population living in private households. The statistical units for which information is provided are individuals and households. Same dwelling, sharing economic resources, common housekeeping and family ties are the main and mostly used criteria to identify a household. However, some methodological differences arise both between surveys and countries in terms of: (a) the application of 'economic interdependence of household members' concept, (b) the length of period of absence and (c) the treatment of specific groups (e.g. students).

For example, in both EU-SILC and LFS the recommended definition for the private household relies on the housekeeping unit concept. However, in the latter both housekeeping and dwelling concept are considered acceptable. In LFS Austria, the household concept used is the dwelling household, while for SILC AT uses the housekeeping concept. Further differences emerge for some countries in terms of the population covered and availability of household level information. Other differences emerge for persons temporary absent from the household dwelling (six months in SILC and one year in LFS for being excluded as household member) and particular groups (e.g students).

The preliminary data analysis for Austria seems to indicate that these differences do not have significant effects on comparability and we can therefore consider that the two populations overlap to a very large extent. However, this conclusion is based on data calibrated already at national level and therefore we might underestimate their impact. More in depth analysis needs to focus on specific aspects (e.g. particular categories, such as students). 2.2. Consistency definitions and scales of common variables

Both surveys provide individual and household level information. The starting point was the set of core social variables2. Most of them have consistent definitions with some exceptions: e.g. the activity status is optional for UK, DK; we have just the deciles for wage in LFS; for marital status and multiple citizenships there are some small differences in wording/guidelines for implementation; different typologies are applied for household composition variable(s).

In addition to the core social variables, the two sources share additional individual level labour and education variables. Data preparation and harmonization required several actions to enable the joint use and analysis as most variables need to be harmonized in terms of codification, level of aggregation, and/or format. Some variables are similar but cannot be harmonized: e.g. Years of work experience.

Both surveys provide also a great variety of additional information on the size and structure of households, number of children (dependent and non-dependent), and number of active/inactive individuals and so on. These are particularly relevant in the context of our objective as the poverty indicator is based on the household disposable income and therefore needs to be linked to household level covariates. Currently household variables

2 http://epp.eurostat.ec.europa.eu/cache/ITY_OFFPUB/KS-RA-07-006/EN/KS-RA-07-006-EN.PDF

3

Page 71 of 199 are often composed according to different criteria in the two sources and there are not clear standard outputs. Both enhanced harmonization and better documentation of the differences are required to foster the integration of information provided on household level. The existence of harmonized basic information allowed us to reproduce the same household variables in both surveys. These are essentially based on the combination of several socio-economic characteristics of the household members (see Table 2-1 in the annex). Thus, households are described in terms of several dimensions as follows:

• Household types in terms of size and socio-demographic characteristics of members

• Prevalence of employed/ retired/inactive persons,

• Prevalence of highly/low/medium educated,

• Prevalence of people in “high earnings” occupations/sectors. 2.3. Coherence of marginal distribution

Marginal and joint distributions were compared both for the individual and household level variables. There are three different methodologies for the analysis of distributions that were explored:

• The first and simplest one is to compute, for each potential common variable, the weighted frequency distributions for each category in the two surveys involved, and to calculate the differences. The maximum value of these differences can be taken as a criterion for comparison. Coherence of the variable in the two surveys will be rejected if this maximum difference is higher than 5 percentage points. Obviously, this is simply a rule of thumb without much theoretical background, and the threshold established is, on the other hand, arbitrary.

• Another possibility is to quantify similarity of two distributions so that we could give a relative measure of differences in the distributions of various common variables at different levels (national and regional level). We apply the Hellinger distance (HD) which lies between 0 and 1. A value of 0 indicates a perfect similarity between two probabilistic distributions, whereas a value of 1 indicates a total discrepancy. The Hellinger distance between a variable V in donor data source and the corresponding variable V’ in the recipient data source is:

2 1 K 2 1 K ⎛ n n ⎞ VVHD ' ),( ∑ ( ' iVpiVp )()( ) ⋅==−=⋅= ∑⎜ Di − Ri ⎟ 2 = 2 = ⎜ N N ⎟ i 1 i 1 ⎝ D R ⎠

where K is the total number of cells in the contingency table, nDi is the frequency of cell i in the donor data D, nRi is the frequency of cell i in the recipient data R and N is the total size of the specific contingency table.

• The third group refers to statistical tests for the similarity of distributions (chi square; Kolmogorov Smirnov, Wald-Wolfowitz tests). These methods could give a stronger base to the conclusions on similarity/discrepancy between distributions coming from the two sources. However, both surveys have a complex design and this category of tests generally requires information on sampling design variables. LFS doesn't

4

Page 72 of 199 provide this information at Eurostat level. Further work can investigate their application given the available data.

Our analysis was based on the first two methods for giving us a combined view on coherence of distributions. The Hellinger distance metric allows us to provide an easier to read comparative situation of the discrepancies in the data. Figure 2-1 provides an overview for both individual and household level variables. Inconsistencies at individual level translate in difficulties for the related household variables: number retired and number inactive. However, often inconsistencies come from "small cells". By aggregating these categories, the similarity of distributions improves. For instance, if we look at "self declared activity status" (LABOUR), by aggregating domestic tasks and other inactive into a single group, the Hellinger distance decreases from 6.72% to 1.41%. There are also some discrepancies for number of adults in the household working in low/medium earning occupations. In the annexes (table 5-2 in the annex), there are more detailed statistics on the coherence of marginal distributions at regional level.

Figure 2-1

Hellinger Distance

20.00%

18.00%

16.00%

14.00%

12.00%

10.00%

8.00%

6.00%

4.00%

2.00%

0.00% EDU AGE OCUP CTR_B CTR_R CTR_C URBAN URBAN LESS65 HHSIZE REGION REGION OVER65 HHTYPE LABOUR EMPLOY SECTOR GENDER NBCHILD NB_SELF CON_UNI NB_EMPL NB_MALE NBADULT MAR_STA NBDEPCH NBHOURS MANAGER MANAGER NB_RETIRE OLD_ADULT NB_FEMALE NB_UNEMPL NB_MEDEDU NB_LOWEDU NB_HIGHEDU NB_INACTIVE NB_MEDSECT NB_LOWSECT NB_MEDOCUP NB_HIGHSECT NB_LOWOCUP NB_HIGHOCUP YOUNG_ADULT MIDAGE_ADULT Variables at individual level Variables at household level

These results and relevant inconsistencies need to take into account the weighting procedures that are applied at national level, weighting factors and benchmark files used for calibration, which are often different between sources.

In conclusion, ensuring coherence in terms of statistical output (marginal and joint distributions) needs both in-depth analysis and documentation of concepts and survey methods, as well as further methodological developments. Inconsistencies can emerge due to different concepts, due to operational differences, but also due to different survey

5

Page 73 of 199 methods to treat missing information, weighting etc. Better coherence is essential for the complementary use of different data sources and it requires systematic checking of main distributions at MS and/or Eurostat level.

3. A method for estimating regional poverty measures

The validity of the exercise depends to a great extent on the selection of the model and the power of common variables to behave as good predictors. Our main target variable is the at-risk-of-poverty which is a binary index based on the relative position of the individual in the distribution of the income. Those bellow 60% of the median are considered income poor.

Several studies in the field of small area estimation for poverty, take income as the target variable and on the basis of income estimates they recompute other poverty measures, such as the at-risk-of-poverty rate (Molina and Rao, 2010), we decided to focus on modelling the income variable. However, a further issue emerges in relation to the level of analysis. Even if the at-risk-of-poverty index relates to the ranking in the distribution of individuals, its computations is done by assuming perfect intra-housing sharing of resources. The household disposable income, equivalised by the household size, is imputed to all individuals in the household no matter their actual contribution to the total resources of the household. Therefore, in the inference process we decided to focus our estimations on the household income. This is the income a household receives from wages and salaries, self-employment, benefits, pensions, plus any other sources of income. The household income is not normally distributed but positively skewed. Therefore, we use in the model the logarithm of the income so that this skewness is reduced and it can be assumed for the analysis that the transformed variable follows a normal distribution. The whole estimation process is done at household level. The proposed method for providing model based regional estimates followed four main steps: • fit a model at household level for the logarithm of household income based on EU-SILC • multiply impute (L times) on the basis of the model "real donors" in LFS • re-compute at-risk-of-poverty in LFS for each of the generated L vectors, • estimate model-based regional at-risk-of-poverty rate (mean based on L imputations) and assess quality

3.1. Model specification

The analysis and techniques carried out aim at identifying the subset of common variables that best explain household disposable income. As several socio-economic factors contributing to poverty levels are at individual level we needed to translate individual characteristics in household typologies. As the reference person is defined differently in LFS and SILC and we cannot identify the "main income earner", we decided not to use the head of household characteristics. We used as predictors mainly the number/prevalence in the household of certain individual characteristics that determine the socio economic status of the household. For example, based on SILC we classified economic and occupation sectors in low, medium and high earning. We

6

Page 74 of 199 therefore used as explanatory variables the percentage of household adults working in each of these categories. In the first step, we correlated several variables with household income: the strongest positive correlations are for number of active people, number of highly educated people in the household, while the negative ones are for the number of unemployed, living alone, and single with children. Then, we have regressed the log of income on a subset of socio –demographic characteristics of the household. A stepwise regression was carried out in order to select the variables that best explain the household income. We used both alternatives with number and prevalence of adults with certain characteristics (e.g. employed, highly educated) in the household and they seem to give similar results.

The model seems to have a reasonable explanatory power. However, the shortcomings of the unit level area models are related to the non inclusion of location effects. If we ignore the structure and use a single-level model (e.g. individual effect) our analyses may be flawed because we have ignored the context in which processes may occur. One assumption of the single-level multiple regression model is that the measured individual units are independent while in reality the individuals in clusters (areas) have similar characteristics. We have missed important area level effects - this problem is often referred to as the atomistic fallacy. For example, this may occur, when we consider income as an outcome of interest and look at this with respect to household/individual characteristics. We might find that the individual income association with the household type might depend on the regional economic development.

Table 3-1 - Models: Dependent variable =log (household income) - AT Variable MODEL 1 MODEL 2 MODEL 3 Intercept 9.418*** 9.25840*** 9.409*** Household size 0.409*** 0.152*** 0.194*** No dependent children -0.267*** -0.036*** -0.069*** Over 65 0.018*** -0.019*** 0.129*** % female -0.100***

One adult,<65-male -0.389** 0.026** One adult,<65-female -0.409*** -0.006* One adult,>65-male -0.243*** 0.090** One adult,>65-female -0.396*** -0.116*** 2adults,<65 0.027*** 0.446*** Single parents -0.246*** -0.292*** 2adults, 1 dep child 0.036*** 0.434*** Other hh, dep children 0.113** 0.513***

% unemployed adults -0.426*** -0.313*** -0.388*** % employed adults 0.246*** 0.256*** 0.088*** % self employed adults 0.157*** 0.333*** 0.0417*** % inactive adults -0.440*** -0.407*** -0.5171***

% retired 0.063*** 0.059** % Highly educated 0.167*** % Low educated -0.171*** % adults -high earning occupations 0.231*** % adults -low earning occupations -0.116*** % adults -high earning NACE 0.084*** % adults -low earning NACE -0.106*** Manager 0.150*** R2=0.45 R2=0.51 R2=0.57

7

Page 75 of 199 One possibility to introduce this region-dependency is the stratification of the model. This means that we divide our sample in blocks, run the model and allow imputation just within blocks. Separate imputation allows the effects of covariates to vary between regions. This alternative assumes that our sample is informative at regional level and provides enough information to model income.

Another approach that accounts for space correlation is based on the use of hierarchical/nested models that include two levels covariates. By including both level 1 and level 2 predictors in the model, we can account for both individual characteristics as well as region characteristics. These account for between area variations beyond that explained by the variation in unit covariates. These models express relationships among variables within a given level, and specify how variables at one level influence relations occurring at another level. Both random and fixed effects can be used in the same model.

3.2. Matching with LFS

We relied in our exercise on mixed methods for the multiple imputation, and specifically on the "predictive mean matching method". This enables us to incorporate the robustness of regression based methods and in the same time to mitigate the typical "regression to the mean" effect inherent in predictions. The following steps are done through the imputation:

• Regress income on covariates

• Apply estimated coefficients also in LFS

• Find the shortest distance between estimates in SILC/LFS

• Impute (L times) the real value in LFS

We applied the model both globally and by region. In the latter case we allow different effects by region. A person with the same occupation might have different income according to the specific characteristics of the region. A further step will be to include hierarchical models in the multiple imputation procedure, in line with similar exercices in the small area estimation literature (Elbers et al., 2003, Molina and Rao 2010, Pratesi et al, 2011). 3.3. Quality evaluation

Some basic quality checks were implemented in order to check if distributions of imputed/original variables are consistent. Based on the imputed income in LFS, we re- computed the equivalised income and the at-risk-of-poverty at the individual level. We checked both marginal and joint distributions of the targeted variables, based on different models, stratification options and imputations. Table 3-1 to 3-3 highlight some of these results.

Table 3-1 - Distribution household income SILC/LFS (imputed) Survey name (data source) N Obs Variable Mean Median LFS 3565121 Household income 34072.06 28629.51 SILC 3561882 Household income 33097.88 29103.2

8

Page 76 of 199

Table 3-2 - Distribution equivalised income and at-risk-of-poverty for SILC/LFS (imputed) –AT Survey name (data source) N Obs Variable Mean Median LFS 8144008 Eqvinc 20757.56 18822 AROP60 0.1227793 0 SILC 8234551 Eqvinc 21383.51 19010.52 AROP60 0.1235803 0 Table 3-3 – Differences in joint distribution of AROP with common variables (EU-SILC versus LFS) VARIABLE HD WITHOUT STRATA WITH STRATA Joint distribution of HD Joint distribution of HD GENDER 0.00% AROP60 GENDER 0.04% AROP60 GENDER 1.23% AGE 1.06% AROP60 AGE 1.79% AROP60 AGE 1.98% CTR_B 1.74% AROP60 CTR_B 3.29% AROP60 CTR_B 3.07% CTR_C 1.40% AROP60 CTR_C 2.94% AROP60 CTR_C 2.79% MAR_STA 1.74% AROP60 MAR_STA 1.83% AROP60 MAR_STA 2.40% CON_UNI 0.33% AROP60 CON_UNI 0.51% AROP60 CON_UNI 1.32% CTR_R 0.00% AROP60 CTR_R 0.04% AROP60 CTR_R 1.23% URBAN 0.53% AROP60 URBAN 1.38% AROP60 URBAN 1.53% LABOUR 6.72% AROP60 LABOUR 7.28% AROP60 LABOUR 7.30% LABOUR2 1.41% AROP60 LABOUR2 2.37% AROP60 LABOUR2 2.52% EMPLOY 2.71% AROP60 EMPLOY 3.19% AROP60 EMPLOY 4.13% OCUP 2.33% AROP60 OCUP 3.46% AROP60 OCUP 4.16% SECTOR 0.99% AROP60 SECTOR 2.26% AROP60 SECTOR 2.56% EDU 2.76% AROP60 EDU 4.77% AROP60 EDU 4.43% NBHOURS 4.54% AROP60 NBHOURS 4.73% AROP60 NBHOURS 5.20% MANAGER 9.77% AROP60 MANAGER 9.77% AROP60 MANAGER 9.90% Based on the estimated AROP we computed synthetic estimates at regional level. For each region, we calculate the at-risk-of-poverty rate as the mean over L=100 imputations.

L ^ l Yreg ^ ∑ Y = l reg L

Bellow we present some preliminary results comparing the direct and indirect regional estimates for the mean income and at-risk-of-poverty. The results show an artificial reduction of poverty differentials between regions when we apply the same model for the whole sample (figure 3-1). An important factor is certainly the lack in the model of location effects. In fact when we apply strata by region, allowing for imputation just within regions, the indirect (model based) estimates follow the same variability patterns as direct estimates (figure 3-2). Practically, stratified imputation allows having different coefficients by region in the model, and for example the effect of household type will depend on the specific region. However, for certain regions the discrepancies between SILC and LFS become more pronounced. Further work will need to include hierarchical models that account for both household and area level effects.

9

Page 77 of 199 Figure 3-1 – Regional AROP – Austria – Imputation with NO stratification

25

20

15

10

5

0

T) h n rg A e ten ich e ( n e Tirol d rreic Wi rmark rr n e är e e K Salzburg nla röst Vorarlb e Stei e rg b u O B Niederöst

SILC-direct estimate LFS-imputed SILC Conf_int(LB) SILC Conf_int(UB)

Figure 3-2 – Regional AROP – Austria – imputation WITH stratification by region

25

20

15

10

5

0 ) k l n r h rg o rg AT a e ( Wie m reic Tir d er r n Kärnten alzbu rarlb a tei S o nl öste e S r V rg Obe Bu Niederösterreich

SILC-direct estimate LFS-imputed SILC Conf_int(LB) SILC Conf_int(UB)

An exercise was also done for checking the value added of the model based regional estimates. We estimate the mean square error (MSE) by the average of the sum of squares of the replicates estimates around their mean:

L ^ l Y ^ L ^^ l ^ ∑ d = 1 − 2 = l YMSE d )( ∑ YY dd )( , where Yd L l L

10

Page 78 of 199 The standard deviation over all the replicates is the standard error of the estimation. On the basis of these simulations we can compare the original confidence intervals for the direct estimates with the synthetic intervals computed on the basis of the estimated standard error. Even if these first results indicate an improvement in the 'precision' of estimates we need to interpret them with caution as further work needs to develop the methodology for estimating the MSE based on a larger number of replicates, using bootstrap methods. Moreover, the overlap of intervals is sometimes very small and therefore we need to further investigate the root of these inconsistencies.

Figure 3-3 – Overlap between intervals for direct estimates (based on SILC data) and indirect estimates (based on 100 imputations)

25.0

20.0

15.0

10.0

5.0

0.0 I D D D D h , I , , D c h g , I and , I en , ien , I ark , I ur ol irol , I and , l nten , m urg , D T Wi W nten , mark , D b Tir enl r er gen Kär Kär e terreic Salzb sterreich , D österrei rösterreich , I Salz Vorarlberg , I Bur ö Stei Stei ös Vorarlberg , D Burg be eder i Ober O Nieder N

4. Conclusions and further steps

The application of model based estimates for regional poverty indicators it is still research domain and there are still several open issues. This exercise explored one potential approach for improving the precision of SILC regional poverty estimates. Further work will need to focus on specification of multilevel models that incorporate location effects.

For quality assessment, further work will draw on the literature of small area estimation that uses methods on the line of bootstrap and simulation studies for estimating the MSE. This will allow comparing direct and synthetic estimates, in order to assess the potential value added of model based estimates. In some cases the two estimates are combined based on criteria such as the sample size at regional level.

11

Page 79 of 199

5. ANNEXES

Table 5-1 – Household dimension Household derived variable Description

HHTYPE Household type '01' = 'One adult younger than 65 years - male' '02' = 'One adult younger than 65 years - female' '03' = 'One adult older or equal than 65 years - male' '04' = 'One adult older or equal than 65 years - female' '06' = '2 adults, both < 65 years' '07' = '2 adults, at least one 65+ years' '08' = 'Other no dependent children' '09' = 'Single parent, at least 1 dependent child' '10' = '2 adults, 1 dependent child' '11' = '2 adults, 2 dependent children' '12' = '2 adults, 3+ dependent children' '13' = 'Other households with dependent children' '16' = 'Other'; HHSIZE Household size NBADULT Number of adults living in a household NBCHILD Number of children under 18 living in a household NBDEPCH Number of dependent children living in a household OVER65 Number of adults aged 65 or less living in a household OVER65 Number of adults aged over 65 living in a household YOUNG_ADULT Number of young adults (less than 35) living in a household MIDAGE_ADULT Number of mid-age adults (35-65) living in a household OLD_ADULT Number of elder adults (over 65) living in a household NB_MALE Number of male adults living in a household NB_FEMALE Number of female adults living in a household NB_UNEMPL Number of unemployed adults living in a household NB_EMPL Number of employees adults living in a household NB_SELF Number of self-employees adults living in a household NB_RETIRE Number of retired adults living in a household NB_INACTIVE Number of other inactive adults living in a household NB_HIGHOCUP Number of adults living in a household and involved in a high-paid occupation NB_MEDOCUP Number of adults involved in a medium-paid occupation NB_LOWOCUP Number of adults living in a household and involved in a low-paid occupation NB_HIGHSECT Number of adults involved in a high-paid sector NB_LOWSECT Number of adults living in a household and involved in a medium-paid sector NB_LOWSECT Number of adults living in a household and involved in a low-paid sector NB_HIGHEDU Number of high-educated adults living in a household NB_MEDEDU Number of medium-educated adults living in a household NB_LOWEDU Number of low-educated adults living in a household MANAGER Number of adults with managerial position living in a household

12

Page 80 of 199 Table 5-2 – Marginal distributions for each region (AT) REGION VARIABLE HD REGION VARIABLE HD 11 region*URBAN 1.14% 31 region*NBDEPCH 2.35% 12 region*URBAN 0.71% 32 region*NBDEPCH 2.20% 13 region*URBAN 0.00% 33 region*NBDEPCH 1.28% 21 region*URBAN 3.66% 34 region*NBDEPCH 4.36% 22 region*URBAN 2.11% 11 region*NB_CHILD15 2.35% 31 region*URBAN 1.83% 12 region*NB_CHILD15 0.93% 32 region*URBAN 1.97% 13 region*NB_CHILD15 2.18% 33 region*URBAN 2.71% 21 region*NB_CHILD15 1.11% 34 region*URBAN 4.97% 22 region*NB_CHILD15 2.02% 11 region*HHSIZE 7.87% 31 region*NB_CHILD15 2.38% 12 region*HHSIZE 0.95% 32 region*NB_CHILD15 2.13% 13 region*HHSIZE 1.49% 33 region*NB_CHILD15 1.65% 21 region*HHSIZE 3.88% 34 region*NB_CHILD15 3.07% 22 region*HHSIZE 2.46% 11 region*NB_UNEMPL 5.92% 31 region*HHSIZE 2.32% 12 region*NB_UNEMPL 1.45% 32 region*HHSIZE 3.70% 13 region*NB_UNEMPL 1.39% 33 region*HHSIZE 3.67% 21 region*NB_UNEMPL 2.65% 34 region*HHSIZE 4.01% 22 region*NB_UNEMPL 2.34% 11 region*HHTYPE 9.54% 31 region*NB_UNEMPL 3.05% 12 region*HHTYPE 4.67% 32 region*NB_UNEMPL 3.06% 13 region*HHTYPE 5.43% 33 region*NB_UNEMPL 3.27% 21 region*HHTYPE 6.01% 34 region*NB_UNEMPL 9.42% 22 region*HHTYPE 5.09% 11 region*NB_EMPL 8.08% 31 region*HHTYPE 3.62% 12 region*NB_EMPL 3.93% 32 region*HHTYPE 5.75% 13 region*NB_EMPL 3.87% 33 region*HHTYPE 6.87% 21 region*NB_EMPL 5.10% 34 region*HHTYPE 8.57% 22 region*NB_EMPL 6.07% 11 region*NBADULT 6.08% 31 region*NB_EMPL 3.07% 12 region*NBADULT 3.01% 32 region*NB_EMPL 6.70% 13 region*NBADULT 2.67% 33 region*NB_EMPL 6.86% 21 region*NBADULT 4.74% 34 region*NB_EMPL 3.93% 22 region*NBADULT 1.80% 11 region*NB_SELF 2.36% 31 region*NBADULT 2.80% 12 region*NB_SELF 1.23% 32 region*NBADULT 4.59% 13 region*NB_SELF 2.53% 33 region*NBADULT 7.60% 21 region*NB_SELF 6.22% 34 region*NBADULT 5.17% 22 region*NB_SELF 0.23% 11 region*NBCHILD 4.97% 31 region*NB_SELF 1.78% 12 region*NBCHILD 1.37% 32 region*NB_SELF 1.22% 13 region*NBCHILD 2.90% 33 region*NB_SELF 2.25% 21 region*NBCHILD 4.12% 34 region*NB_SELF 0.96% 22 region*NBCHILD 2.49% 11 region*NB_RETIRE 4.02% 31 region*NBCHILD 2.69% 12 region*NB_RETIRE 3.83% 32 region*NBCHILD 2.72% 13 region*NB_RETIRE 2.56% 33 region*NBCHILD 1.43% 21 region*NB_RETIRE 3.84% 34 region*NBCHILD 5.59% 22 region*NB_RETIRE 3.35% 11 region*NBDEPCH 5.22% 31 region*NB_RETIRE 3.53% 12 region*NBDEPCH 1.21% 32 region*NB_RETIRE 1.09% 13 region*NBDEPCH 2.23% 33 region*NB_RETIRE 6.97% 21 region*NBDEPCH 3.46% 34 region*NB_RETIRE 1.83% 22 region*NBDEPCH 1.72% 11 region*NB_INACTIVE 3.45%

13

Page 81 of 199 REGION VARIABLE HD REGION VARIABLE HD 12 region*NB_INACTIVE 2.01% 31 region*NB_INACTIVE 2.02% 13 region*NB_INACTIVE 3.28% 32 region*NB_INACTIVE 6.53% 21 region*NB_INACTIVE 1.92% 33 region*NB_INACTIVE 1.81% 22 region*NB_INACTIVE 2.93% 34 region*NB_INACTIVE 4.96%

14

Page 82 of 199 References

ESSnet on Data integration materials: http://www.essnet-portal.eu/di/data-integration Coli, A., Tartamella, F., Sacco, G., Faiella, I., Scanu, M., D’Orazio, M., Di Zio, M., Siciliani, I., Colombini, S. and Masi, A. (2005) La costruzione di un archivio di microdati sulle famiglie italiane ottenuto integrando l’indagine ISTAT sui consumi del Conti P. L., Di Zio M., Marella D., Scanu M. (2009) Uncertainty analysis in statistical matching, First Italian Conference on Survey Methodology (ITACOSM09), Siena 10-12 June 2009. Conti P.L., Marella D., Scanu M. (2008) Evaluation of matching noise for imputation techniques based on the local linear regression estimator. Computational Statistics and Data Analysis, 53, 354- 365. D’Orazio M., Di Zio M., Scanu M. (2006) Statistical Matching, Theory and Practice. Wiley, Chichester. D’Orazio, M., Di Zio, M. and Scanu, M. (2006) Statistical matching for categorical data: displaying uncertainty and using logical constraints. Journal of Official Statistics, 22, 137-157 Elbers, C., Lanjouw J.O., Lanjouw P. (2003). Micro-Level Estimation of Poverty and Inequality. Econometrica 71(1): 355–364 Gilula, Z, McCulloch, R.E., Rossi, P.E. (2006). A direct approach to data fusion, Journal of Marketing Research, 43, 73-83. Kadane, J.B. (1978) Some statistical problems in merging data files. In Department of Treasury, Compendium of Tax Research, pp. 159–179. Washington, DC: US Government Printing Office. Lanjouw, P., Mathernova K., de Laat J.. World Bank Poverty Maps to Improve Targeting and to Design Better Poverty Reduction and Social Inclusion Policies. Presentation, 24 March 2011. Marella D., Scanu M., Conti P.L. (2008). On the matching noise of some nonparametric imputation procedures, Statistics and Probability Letters, 78, 1593-1600. Molina, I., Rao, J. N. K. (2010). Small area estimation of poverty indicators. Canadian Journal of Statistics, 38: 369–385. Moriarity C. (2009) Statistical Properties of Statistical Matching, VDM Verlag Moriarity, C. and Scheuren, F. (2001) Statistical matching: a paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics 17, 407–422. Moriarity, C. and Scheuren, F. (2003) A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputation. Journal of Business and Economic Statistics 21, 65– 73. Office for National Statistics. Small Area Model-Based Income Estimates, 2007/2008. http://neighbourhood.statistics.gov.uk/dissemination/Info.do?page=analysisandguidance/analysisart icles/income-small-area-model-based-estimates-200708.htm Paass, G. (1986) Statistical match: evaluation of existing procedures and improvements by using additional information. In G.H. Orcutt, J. Merz and H. Quinke (eds) Microanalytic Simulation Models to Support Social and Financial Policy, pp. 401–422. Amsterdam: Elsevier Science. Pratesi, M., Marchetti, S., Giusti, C., Salvati N. (2011). Robust Small Area Estimation for Poverty Indicators. Department of Statistics and Mathematics Applied to Economics, University of Pisa, ITACOSM 2011, Presentation Pisa, 27-29 June 2011. Raessler, S. (2002) Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. New York: Springer-Verlag.

15 Page 83 of 199 Raessler, S., Kiesl, H. (2009). How useful are uncertainty bounds? Some recent theory with an application to Rubin's causal model. 57th Session of the International Statistical Institute, Durban (South Africa), 16-22 August 2009. Rodgers, W.L. (1984) An evaluation of statistical matching. Journal of Business and Economic Statistics 2, 91–102. Rubin, D.B. (1974) Characterizing the estimation of parameters in incomplete-data problems. Journal of the American Statistical Association 69, 467–474. Rubin, D.B. (1986) Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics 4, 87–94. Ruggles, N. (1999) The development of integrated data bases for social, economic and demographic statistics. In N. Ruggles and R. Ruggles (eds) Macro- and Microdata Analyses and Their Integration, pp. 410–478. Cheltenham: Edward Elgar. Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1990) On methods of statistical matching with and without auxiliary information. Technical Report SSMD-90-016E, Methodology Branch, Statistics Canada. Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993) Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology 19, 59–79.

16 Page 84 of 199 Data Integration Application with Coarsened Exact Matching

Mariana Kotzeva Bulgarian NSI, University of National and World Economy, [email protected] Roumen Vesselinov Bulgarian NSI, Sofia University St. Kliment Ohridski, [email protected]

Abstract: This paper focused on the problem of integrating data from two distinct sources or groups for statistical analysis. The two groups could be from representative sample and business registers, or in relation to the non-response bias problem. We investigated the properties of some traditional techniques like propensity scores and simple regression and a more advanced method for coarsened exact matching. The main finding of the paper was the suggestion that most methods were comparable in simple cases of bias but in more complicated cases of bias the exact matching approach was superior.

Keywords: exact matching, data integration, generalized log-linear models

1. Introduction

The problem for integrating data from two (or more) distinct sources could be addressed by some of the already established statistical methods for weighting, propensity scores weighting and stratification (Rosenbaum and Rubin, 1983). On the other hand, some more recent new methods for exact matching had emerged (Iacus et al, 2011a, 2011b). It was our intention to test and compare the estimation properties of the traditional and more recent methods with data for Bulgaria.

2. Data Sources

The data used in this paper were from the Bulgarian register of enterprises for 2008. The variables included were as follows: Type of enterprise: 1= Sole proprietor 0 = Limited liability company or Partnership; Foreign ownership: Yes/No; Regions – 6 economic regions in Bulgaria; Labour: Number of employed; Economic sector: 1=Industry; 2=Services; 3=Agriculture; Revenue: in thousand Bulgarian leva, current prices; In-vestment: spending for capital assets, in thousand Bulgarian leva, current prices; also as binary Yes/No investment; and ratio of investment to revenue (limited to between 0 and 1). Indicator (Dummy) variables were created for the categorical variables whenever necessary. From the population data were excluded enterprises with no employed, or no revenue, or with ratio of investment to revenue greater than one, or extremely large values of revenue or investment. A 5% random sample was drawn from the rest of the population. The final sample size was N=13851. The classical interpretation (Rosenbaum and Rubin, 1983 and Rosenbaum, 2002) focuses the sample selection bias on the imbalance in the covariates between “Treatment” and “Control” groups. In this paper we treated the problem more broadly. Under “sample selection bias” we understood the problem of integrating data from two

Page 85 of 199 sources (sample and register), or addressing the non-response bias (Matsuo et al, 2010). For this purpose we introduced a bias indicator variable (0/1) where 0 was interpreted as the sample data and 1 as the data from the register of enterprises. We worked with two types of bias, “random” and “non-random”. For the random bias we generated a random variable that assigned the cases (40% to 60% ratio) to the two groups (e.g. sample and register). For the non-random bias we assigned value of 1 to all enterprises with only 1 employed person and 0 for the rest.

3. Methodology

Three different types of models were considered: Model 1 : Regression model with Revenue as dependent variable and Labour as independent; Model 2: Logistic regression model with Investment (Y/N) as dependent variable and Labour as independent; Model 3: Zero-Inflated Poisson (ZIP) Model with Ratio of Investment/Revenue dependent on Labour (in thousands). The ZIP model was specifically designed (Long, 1997 and Lambert, 1992) to handle count or rate (like in our case) variables with many zeroes. In our sample 71.3% did not have any investment. This is a type of generalized log-linear model or a mixture model with two classes: zero and non-zero. Voung (1989) proposed test to determine whether the ZIP model is to be preferred to the traditional Poisson model. Four different methods for addressing sample selection bias were implemented in the paper: A. No weighting and no matching; B: Propensity score weighting; C: Propensity score stratification (5 strata); and D: Coarsened exact matching (CEM). The propensity score methods involved first estimating a logistic regression model with the bias (0/1) as dependent variable and region, type, foreign ownership, and economic sector as independent. The predicted values of the models were saved as propensity scores (PS). They were used in two ways, as weights (similar to Matsuo et al, 2010) and by creating 5 strata based on the PS quintiles as suggested by Rosenbaum and Rubin (1983). CEM is a type of exact matching method which reduces the potential differences between the data from the two data sources (sample and register) by grouping or coarsening the data into bins and exact matching the data and then running the analysis on the matched data. This is type of monotonic imbalance bounding and it has very attractive statistical properties (Blackwell et al, 2009 and Iacus et al, 2011a, 2011b).

4. Results

The analysis was done separately for the random and non-random bias and for the three models using standard methods and the three methods for adjustment of the sample bias.

4.1. Results for Random Bias

This was the case where, some of the data were considered as collected by a survey and some from a register and there was no known pattern or bias related to the source of the data. The results for the random bias estimation are presented in Tables 1, 2 and 3. For the regression model and the logistic regression model CEM worked as well as the other methods (see Table 1 and 2 respectively). For the ZIP model (Table 3) the PS stratification did not work well, while the other 3 worked similarly well. The conclusion

Page 86 of 199 was that in the case of random bias the use of CEM did not gain much compared to the PS- based methods. The results were comparable.

Table 1: Random Bias Estimation Results for Model 1.

Method Regression P-value 95% CI Coefficient A No weighting and no matching 89.5 <.001 87.6-91.3 B Propensity Score Weighting 88.7 <.001 85.8-91.5 C Mean Propensity Score 5 97.3 <.001 93.7-100.9 Strata D Coarsened Exact Matching 91.2 <.001 89.4-93.1

Table 2: Random Bias Estimation Results for Model 2. Method Odds Ratio P-value 95% CI A No weighting and no matching 1.16 <.001 1.15-1.17 B Propensity Score Weighting 1.16 <.001 1.15-1.18 C Mean Propensity Score 5 1.20 <.001 1.16-1.23 Strata D Coarsened Exact Matching 1.16 <.001 1.15-1.17

Table 3: Random Bias Estimation Results for Model 3. Method Incidence-Rate Ratio P-value 95% CI A No weighting and no matching 1.69 0.009 1.14-2.51 B Propensity Score Weighting 1.70 0.096 0.91-3.18 C Mean Propensity Score 5 3.64* Range too Range too Strata* wide. wide. D Coarsened Exact Matching 1.72 0.007 1.16-2.54 * Two extreme results excluded.

4.2 Results for Non-Random Bias

This was the case where, for example, some of the data were collected by a survey and some from a register and there was a known pattern to where the data came from. As in our experiment, the data for small enterprises (only 1 employed person) came only from register, while the data for larger enterprises (more than 1 employed) came from survey. The results for the non-random bias estimation are presented in Tables 4, 5 and 6. For the regression model, CEM showed very different results than the other 3 methods (see Table 4). The coefficient estimate and its 95% CI were below the range of the other methods. Theoretically the exact matching had some advantages over the PS methods so we were more likely to believe the CEM results. So in this case CEM did make a difference.

Table 4: Non-Random Bias Estimation Results for Model 1. Method Coefficient P-value 95% CI A No weighting and no matching 89.5 <.001 87.6-91.3 B Propensity Score Weighting 80.8 <.001 78.3-83.3 C Mean Propensity Score 5 87.4 <.001 83.6-91.3 Strata D Coarsened Exact Matching 73.2 <.001 71.7-74.7

Page 87 of 199 Table 5: Non-Random Bias Estimation Results for Model 2. Method Odds Ratio P-value 95% CI A No weighting and no matching 1.16 <.001 1.15-1.17 B Propensity Score Weighting 1.19 <.001 1.17-1.22 C Mean Propensity Score 5 1.22 <.001 1.18-1.27 Strata D Coarsened Exact Matching 1.19 <.001 1.18-1.21

Table 6: Non-Random Bias Estimation Results for Model 3. Method Incidence-Rate Ratio P-value 95% CI A No weighting and no matching 1.69 0.009 1.14-2.51 B Propensity Score Weighting 1.72 0.118 0.87-3.41 C Mean Propensity Score 5 1.37* Range too wide. Range too Strata wide. D Coarsened Exact Matching 1.67 0.049 1.00-2.78 * Three extreme results excluded.

For the logistic regression model (Table 5) and the ZIP model (Table 6) all the methods except the PS stratification gave similar results.

5. Discussion

The results of this study showed that the theoretical advantages of the CEM and the class of exact matching methods were confirmed by the empirical results. CEM per- formed equally well as the PS methods and in some cases it gave very distinct results. More empirical work is needed, but in our opinion the exact matching methods for adjustment of sample bias and data integration deserve the attention of researchers and practitioners.

References

Blackwell, M., S. Iacus, G. King, G. Porro (2009) CEM: Coarsened exact matching in Stata, in The Stata Journal, Number 4: pp. 524-546. Iacus, Stefano M., Gary King, and Giuseppe Porro (2011a) Causal Inference Without Balance Checking: Coarsened Exact Matching, in Political Analysis, 2011. Iacus, Stefano M., Gary King, and Giuseppe Porro (2011b) Multivariate Matching Methods That are Monotonic Imbalance Bounding, in Journal of the American Statistical Association, 106 (2011): 345-361. Lambert, D. (1992) Zero-inflated Poisson regression models with an application to defects in manufacturing, in Technometrics, Feb; 34(1): pp. 1-14. Long, J. (1997) Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. Matsuo, H., G. Loosveldt, J. Billiet, F. Berglund, O. Kleven, Measurement and adjustment of non-response bias based on non-response surveys: the case of Belgium and Norway in the European Social Survey Round 3, Survey Research Methods, 2010, Vol. 4, No.3, pp. 165-178. Rosenbaum, P., D. Rubin (1983) The central role of the propensity score in observational studies for causal effects, in Biometrika, 70(1):41-55. Rosenbaum, P. 2002 Observational Studies, 2nd ed.. NY: Springer-Verlag. Vuong, Q. (1989) Likelihood Ratio Tests for model selection and non-nested hypotheses, in Econometrica, Mar; 57(2):307-333.

Page 88 of 199

Section III – Data Integration in practice

Page 89 of 199 Data integration and SDE in Poland – experiences and problems

Elzbieta Golata Poznan University of Economics, al. Niepodleglosci 10, 61-875 Poznan, Poland, e-mail: [email protected]

Abstract: The aim of the study could be identified twofold. On one hand, it was presentation of Polish experiences as concerns the most important methodological issues of contemporary statistics. These are the problems of data integration (DI) and statistical estimation for small domains (SDE). On the other hand, attempts to determine relationship between these two groups of methods were undertaken. Given convergence of the objectives of both SDE and DI, that is: striving to increase efficiency of the use of existing sources of information, simulation study was conducted. It was aimed at verifying the hypothesis of synergies referring to combined application of both groups of methods: SDE and DI.

Keywords: small domain estimation, data integration

1. Aim of the study The study was aimed at presentation of Polish experiences in Small Domain Estimation (SDE) and Data Integration (DI). This goal will be realized in an indirect way. First, some basic remarks concerning both methods will be discussed pointing out similarities and dissimilarities. Especially in such dimensions as: purpose, methods and technics, data sources, evaluation and other problems and threats that appear with practical application. In general, both methods are used to improve the quality of the statistical estimates, to increase their substantive range and precision using all available sources of information. It can be assumed that combined application of both methods will result in synergy effects on the quality of statistical estimates. Small Domain Estimation are techniques aimed to provide estimates for subpopulations (domains) for which sample size is not large enough to yield direct estimates of adequate precision. Therefore, it is often necessary to use indirect estimates that ’borrow strength’ by using values of variables of interest from related areas (domains) or time and sometimes of both: time and domains. These values are brought into the estimation process through a model. Availability of good auxiliary data and suitable linking models are crucial to indirect estimates (see Rao 2005). Review of small area estimation methods is included, among others, in such works as Gosh and Rao (1994), Rao (1999, 2003), Pfeffermann (1999) and Skinner C. (1991). Data Integration could be understood as a set of different techniques aimed to combine information from distinct sources of data which refer to the same target population. Moriarity and Scheuren (2001, p.407) indicated that practical needs formed the basis for the development of statistical methods for data integration (see also Scheuren 1989). Among the basic studies in this subject, the following should be mentioned Kadane (2001), Rogers (1984) Winkler (1990, 1994, 1995, 1999, 2001), Herzog T. N., Scheuren

1

Page 90 of 199 F. J., Winkler W.E. (2007), D’Orazio M., Di Zio M., Scanu M. (2006) and Raessler (2002). Because of the growing need for complex, multidimensional information for different subsets or domains, in times of crisis and financial constraints, data integration is becoming a major issue. The problem is to use information available from different sources efficiently so as to produce statistics on a given subject while reducing costs and response burden and maintaining quality (Scanu 2010). Both groups of techniques refer to additional data sources that are specifically exploit. These can be two data sets that are obtained from independent sample surveys. Another, often encountered situation refers to the use of administrative data resources as registers. In this case data from registers are linked to survey data. Via data integration process we can extend - enrich the information available from a sample survey with data from administrative registers. In this way we enable 'borrowing strength’ from other data sources at individual level, which, assuming a strong correlation, allows for estimating from the sample for domains at lower aggregation level than the one resulting from the original sample size. This seems to be the most important connection between SDE and DI and the main advantage of the joint implementation of both techniques. For this reason, an attempt was made to determine relationship between these two groups of methods. Given convergence of the objectives of both SDE and DI, that is: striving to increase efficiency of the use of existing sources of information, simulation study was conducted. It was aimed at verifying the hypothesis of synergies in data quality and availability resulting from combined application of both groups of methods: SDE and DI. First basic characteristics of both groups of methods will be presented in the context of Polish experience. Next two simulation studies, which attempt to apply the indirect estimation methodology for databases resulting from the integration of different sources will be discussed. In the first case this will be data from a sample survey and administrative records. Second case study refers to data from two surveys. Procedures used in simulation studies will be discussed in more detail with references to the literature. An empirical assessment of the simulation studies will form the basis for final conclusions.

2. Data Integration and Small Domain Estimation in Poland

For a long time the need to use alternative sources of information in Polish public statistics was not conscious. Exception may constitute such fields which traditionally made use of administrative resources as justice statistics. But on the other hand even in such basic areas as vital statistics, the administrative records were not fully accepted .. for example, the Central Population Register PESEL, over the years was not used for constructing population projections. Significant differences were observed in the population structure by age and place of residence according to official statistics estimates based on census structure and the register (see graph 1). The divergence measured by the relative difference W in the number of population estimates by / PL tt official statistics (Lt) and Population Register (Pt) for the city of Poznan at the end of 2000, amount to even more than 30% . PL ⋅− 100)( W = tt (1) / PL tt Pt

2

Page 91 of 199 Three highest relative differences deserve particular attention. The first is almost 8 percentage of the surplus of population estimates in comparison with the registered for those at zero years of age (children before first year). As this difference relates to the same degree for both sexes, it can be assumed that it stems from the delay in births register. Another characteristic is the excess in population estimates for age 18 - 25 years. The reason for this is probably due to recognition by the census of young people (students or working in Poznan) as permanent residents, although they do not have such status. But population register refers to legal status notified by permanent residence. For people over 25 years, a systematic decrease in the relative differences can be noticed. This may indicate a return of persons to their place of permanent residence, or legalization of their residence because of a work or marriage. Also a significant negative difference could be noticed between population estimates and register for population aged about 85 years and more. This is probably related to the under-coverage of the elderly in the last census. Confirmation of this hypothesis can be found in population tables for subsequent years after the census, in which negative numbers of people aged over 90 should be observed, if death by age would be considered for various levels of spatial aggregation. It follows that the dying person were not included in the census (see Multivariate analysis of errors …, 2008, p.13-14).

Figure 1: Relative differences between population estimates by official statistics (Lt) and Population Register (Pt), city of Poznan, 31.12.2000 Source: Tomasz Józefowski, Beata Rynarzewska-Pietrzak, 2010

Changes in the intensity of use of administrative records took place within the last five years, during preparations for the National Census of Population and Housing which was conducted from April to June 2011. This census was based on the population register but used data from about 30 other registers. In addition, a survey on a 20% sample allowed collection of detailed information on demographic and social structures as well as economic activity. Among Polish main experiences in SAE and DI one should mention:

3

Page 92 of 199 1. EURAREA- Enhancing Small Area Estimation Techniques to meet European needs, IST-2000-26290, Poznan University of Economics, 2003 – 2005 2. ESSnet on Small Area Estimation – SAE 61001.2009.003-2009.859, Statistical Office in Poznan, 2010 – 2011 3. ESSnet on Data Integration – DI 61001.2009.002-2009.832, Statistical Office in Poznan, 2010 – 2011 4. Modernisation of European Enterprise and Trade Statistics – MEETS 30121.2009.004-2009.807, Central Statistical Office, 2010 – 2011 5. Experimental research conducted by Group for mathematical and statistical methods in : Polish Agriculture Census PSR 2010 and National Census of Population and Housing NSP 2011 • Data Integration of Central Population Register PESEL and Labour Force Survey, July 2009 • Nonparametrical matching: datasets from a micro-census and Labor Force Survey, 2011 • Propensity scores matching: Labour Force Survey and Polish General Social Survey PGSS to enlarge the information scope of the social data base, May 2011

Both groups of methods: Data Integration as well as Small Domain Estimation refer to additional data sources. In SDE auxiliary data it is needed to ‘borrow strength’. To meet this requirement, the additional , external data source should be a reliable one. Typically, due to specified by law, rules regulating organization of the registers, administrative records data seem to satisfy this requirement. It is also important, that in many cases, registers provide population data (and though the population in task might be differently defined, also population total). On the other hand, there are some small areas estimators that require domain totals. So in the estimation procedure individual data is not necessary. To resume, we begun with applying small domain estimation methodology with area level models. First we use integrated data from sample and register, and secondly a case of two integrating samples was considered. In each of the two cases a simulation study was conducted and small domain estimators: GREG, SYNTHETIC and EBLUP were applied to integrated data. In the next section presentation of experiences in integration sample data with registers refer to results obtained within the MEETS1 project. In the following section, study on integration of two samples was based on a pseudo-population data from Polish micro- census 1995. The process of estimating statistics for small domains applied in both sections relied on findings of the EURAREA2 project. The main task of the project was to popularize indirect estimation methods and to assess their properties with respect to complex sampling designs used in statistical practice. In addition to conducting a detailed analysis of the research problem, the project participants created specialist software designed to implement estimation techniques developed in the project. The

1 The MEETS project was conducted under Grant Agreement No. 30121.2009.004-2009.807 signed on 31.10.2009 between the European Commission and the Central Statistical Office of Poland between 01.11.2009 and 28.02.2011. The Project was aimed at Modernisation of European Enterprise and Trade Statistics, especially to examine the possibilities of using administrative register to estimate enterprise indicators. 2 The European project entitled EURAREA IST-2000-26290 Enhancing Small Area Estimation Techniques to meet European needs was part of the Fifth framework programme of the European Community for research, technological development and demonstration activities. The project was coordinated by ONS – Office for National Statistics, UK) with the participation of six countries: The United Kingdom, Finland, Sweden, Italy, Spain and Poland.

4

Page 93 of 199 software, with associated theoretical and technical documentation, was published on the Eurarea project website3 (Eurarea_Project_Reference_Volume, 2004). Estimation within both sections was conducted using the EBLUPGREG program4. Description of the estimators used in this study presented in Annex 1 is based on R. Chambers and A. Saei (2003).

3. Empirical evaluation of SDE for linked data - integrating sample data with register - MEETS

One of the goals of the MEETS project was to highlight possibilities of using administrative resources to estimate enterprise indicators in twofold way (see Use of Administrative Data for Business Statistics (2011): - to increase the estimation precision - to increase the information scope by providing estimates taking into account kind of business activity (PKD classification) at regional level.

Data Integration The following administrative systems constituting potential sources for short-term and annual statistics of small, medium and big enterprises had been identified, described and used as auxiliary data source in the estimation process: 1) Tax system – information system conducted by the Ministry of Finance – fed with data from tax declarations and statements as well as identification request forms in the field of: − database on taxpayers of the personal income tax – PIT − database on taxpayers of the corporate income tax – CIT − database on taxpayers of the value added tax – VAT − National Taxable Persons Records – KEP. 2) System of social insurance – information system conducted by the Social Insurance Institution, the so-called Comprehensive IT System of the Social Insurance Institution (KSI ZUS) fed with data from insurance documents concerning contribution payers and the insured Central Register of the Insured (CRU) and Central Register of Contribution Payers (CRPS): − register of natural persons (GUSFIZ) − register of legal persons (GUSPRA). The primary source of data on companies in Poland is the DG1 survey carried out by Central Statistical Office. This survey covers all large companies (of more than 50 employees) and 20% sample of medium-sized enterprises (the number of employees from 10 to 49 people). In the research the following data referring to DG1 survey were used: − The DG-1 database directory - list of all small, medium and large economic units used as a frame − DG-1 survey for 2008.

3 The Eurarea_Project_Reference_Volume (2004) can be downloaded from http://www.statistics.gov.uk/eurarea. 4 Veijanen A., Djerf K., Sőstra K., Lehtonen R., Nissinen K., 2004, EBLUPGREG.sas, program for small area estimation borrowing Strength Over Time and Space using Unit level model, Statistics Finland, University of Jyväskylä

5

Page 94 of 199 The data available constituted of over 180 files of different size and structure. were For purposes of the study December 2008 was treated as a reference period, as for this period most information from administrative databases was available. To match the records from different datasets, two primary keys were used: NIP and REGON identification numbers. The purpose of integration was to create a database, in which an economic entity would be described by the largest possible number of variables. The DG-1 directory from December 2008 was used as a starting point. This data set was combined with information from the administrative databases and DG-1 reporting. The main obstacle to matching records were missing identification numbers5.

Table 1: Results of integrating datasets from statistical reporting and administrative databases Number of matched records Percentage Number of all sections 4 sections * Voivodships of unmatched records with DG-1 DG-1 DG-1 DG-1 records NIP duplicates directory directory Dolnoslaskie 6044 2176 4561 1601 2,7 37 Kujawsko- 4018 1694 3331 1392 2,2 13 pomorskie Lubelskie 3040 1217 2485 961 1,4 2 Lubuskie 2278 944 1789 733 1,4 7 Lodzkie 5666 2153 4707 1744 2,1 56 Malopolskie 6844 2402 5314 1860 2,6 45 Mazowieckie 15059 4783 11172 3578 13,5 167 Opolskie 1912 852 1519 654 1,7 7 Podkarpackie 3543 1529 2925 1239 1,3 16 Podlaskie 1892 774 1540 614 1,9 7 Pomorskie 5220 1744 3906 1347 4,2 16 Slaskie 11066 3970 8728 3049 2,5 47 Swietokrzyskie 2131 902 1730 687 1,8 24 Warminsko- 2932 1093 2159 847 5,7 7 mazurskie Wielkopolskie 10553 3256 8460 2724 11,2 57 Zachodniopom 3270 1209 2324 911 4,7 32 orskie Remark: * The study was restricted to the following four biggest PKD sections: processing industry, manufacturing, trade, transport Source: Use of Administrative Data for Business Statistics, GUS, US Poznan 2011

In the process of database integration a special MEETS real data set was created. It contained records about economic entities representing the four PKD sections of economic activity (manufacturing, construction, trade, transport), which participated in the DG-1 survey in December 2008 and which were successfully combined with

5 It should be stressed that the REGON number is used as the main identification number for statistical sources, while institutions such as the Ministry of Finance or the Social Insurance Institution rely mostly on the NIP number.

6

Page 95 of 199 information from the the KEP, CIT, PIT and ZUS databases (see tab. 1). The database was treated as the population in the simulation study. There were various reasons for multiple matching of NIP numbers. In the case of some enterprises, the ZUS register contained 2 or more NIP numbers for one REGON number6. The majority of records that couldn’t be matched were those relating to small entities. For example, out 1,183 records of the DG-1 directory for the Wielkopolska voivodship that couldn’t be matched with register records, 1,173 were small entities. This indicates that the DG-1 directory is largely out of date with respect to enterprises employing from 10 to 49 persons. In the case of medium and big enterprises, which are all subject to the DG-1 reporting, the data are regularly updated. In contrast, only 10% of small enterprises are subject to DG-1 reporting. Consequently, it is impossible to update the DG-1 directory for this section of enterprises7.

A. Scale fitted to units with the highest revenue B. Scale not fitted to units with the highest revenue (limited to PLN 10 000 000) (limited to PLN 10 000)

Figure 2: Relationship between the values of accumulated revenue - from DG-1, PIT or CIT register, all units together 2008 Source: Use of Administrative Data for Business Statistics, GUS, US Poznan 2011

Following the integration of databases it was possible to assess the quality of information provided by the statistical reporting. One noteworthy fact was a considerable number of economic entities with the null value for revenue in the DG-1 survey and positive values of revenue in the PIT and CIT databases (see fig. 2.A and 2.B). Most discrepancies between values in the databases and those in the DG-1 survey could be accounted for by a certain terminological incompatibility between the definition of revenue in each of the data sources. In the DG-1 survey the variable revenue comprises only sales of goods and services produced by the enterprise. Consequently, if an enterprise doesn’t produce anything but acts only as a sales agent, it earns no revenue according to this definition.

6 This situation occurred when the activity of a given enterprise was carried out by more persons, each identified by a separate NIP number. In the case of the parent business unit and its local units, the first 9 digits of 14-digit REGON numbers were identical. As DG-1 directory contains only 9-digit numbers, identifying the parent business unit, data integration resulted in combing information about the parent business unit as well as other related local units present in the databases. 7 Statistical offices have only registration information at the start of economic activity – when REGON number is assigned. Information about the activity closure has only been systematically available since the introduction of new regulations in 31 March 2009.

7

Page 96 of 199 Scatterplot presenting DG1 and PIT data (fig. 2.A) seem to center around the identity line. However closer analysis reveals that the line is formed largely by relatively numerous units characterized by extreme values of revenue. If these units were omitted by limiting revenue to the level of PLN 10,000, the resulting picture is significantly different (fig. 2.B). In addition to units, for which revenue reported in the DG-1 survey coincides with the value reported in tax return forms (y1=y2), one can see two other patterns. First, there is a large group of units reporting positive revenue in the DG-1 survey while displaying missing or zero values in the tax register (represented by dots lying on the X-axis). This phenomenon can partly be accounted for by the terminological discrepancy between the definition of revenue in the DG-1 survey and the PIT/CIT tax register. Another, equally large group, is made up of units whose revenue reported in tax return forms considerably exceeded values reported in the DG-1 survey (represented by dots lying above the identity line (y1=y2). It’s worth noting that there were virtually no cases of units reporting lower revenue in tax return forms than in the DG-1 survey. In order to estimate selected variables of economic entities their specific characteristics should be taken into account. One of the major challenges are non-homogenous distributions. This refers both to variables estimated on the basis of sample surveys and those coming from administrative databases, which are used as auxiliary variables in the estimation process (see fig. 3.A and 3.B). The effect of outliers on estimation can be significant, since in such situations estimators don’t retain their properties such as resistance to bias or efficiency. Outliers, non-typical data or null values, however, are an integral part of each population and cannot be dismissed in the analysis. For this reason, in addition to using the classic approach, work is being done to develop more robust methods8. Such methods could be mentioned as GREG estimation, the model of Chambers or Winsor estimation (see R. Chambers, 1996, R. Chambers, H. Falvey, D. Hedlin, P. Kokic, 2001 and Dehnel, 2010).

4500 4500

4000 4000

3500 3500

3000 3000

2500 2500

2000 2000

1500 1500

1000 1000

500 500

0 0

A. DG1 data B. PIT or CIT Register data Figure 3: Distribution of enterprises by annual revenue, 2008 Source: Use of Administrative Data for Business Statistics, GUS, US Poznan 2011

All variables from the DG-1 survey and administrative databases were taken into account in modelling and correlation analysis. Despite of certain discrepancies between variable values in the two sources correlation was regarded as strong. Simulation study was conducted on 1000 samples drawn from the MEETS real data set according to the

8 Robust estimation methodology, as more complicated and challenging to use, will be dealt with in more detailed in further studies.

8

Page 97 of 199 sampling design as the one used by GUS. For each sample ‘standard’9 SDE estimators: GREG, SYNTHETIC and EBLUP were applied to estimate revenue and other economic indicators in the breakdown of PKD sections at country and at regional level10.

Estimation of revenue by PKD section The results of estimating revenue at the level of selected PKD sections are presented in Tables 2–4. Table 2 contains expected values obtained in the simulation study after 1000 replications. The last column contains mean revenue within each section in the MEETS real data set. It is used as the benchmark to assess the convergence of estimates. The actual assessment of estimation precision and bias is possible using information presented in tables 3 and 4.

Table 2: The expected value of estimators for revenue, 2008 Estimator PKD Section Population MEAN DIRECT GREG SYNTHETIC EBLUP Manufacturing 54585.85 54625.55 54768.17 54661.80 54576.28 Construction 34855.68 34836.24 34559.73 34703.67 34898.88 Trade 80320.49 80244.88 79884.69 80201.53 80280.19 Transport 63016.47 63255.07 63625.85 63386.54 63028.05 Source: Golata (2011)

Table 3: REE of estimators for revenue, 2008 REE (%) PKD Section DIRECT GREG SYNTHETIC EBLUP Manufacturing 0.55 0.37 0.49 0.31 Construction 2.47 0.78 1.14 0.84 Trade 2.17 0.60 1.50 0.66 Transport 1.28 1.73 1.02 1.43 Source: Golata (2011)

Table 4: Absolute bias of estimators for revenue, 2008 Absolute bias of estimators PKD Section DIRECT GREG SYNTHETIC EBLUP Manufacturing 9.57 49.26 191.88 85.52 Construction 43.20 62.65 339.15 195.21 Trade 40.30 35.30 395.50 78.65 Transport 11.58 227.02 597.80 358.49 Source: Golata (2011)

To assess the composite estimation one can use REE. This measure is based on estimates of MSE, which can be compared with its ‘real’ value, thus accounting for estimation precision and bias. The GREG and EBLUP estimators yielded similar estimates for each of the PKD sections. A significant improvement in estimation

9 The estimators referred to as ‘standard’ in terms of EURAREA project are: direct (Horvitz-Thompson), GREG (Generalised REGression), regression synthetic and EBLUP (Empirical Best Linear Unbiased Predictor) estimators. 10 All programming and estimation work was carried out in the Centre for Small Area Estimation at the Statistical Office in Poznan.

9

Page 98 of 199 precision was observed. For manufacturing, where the best results were obtained, REE is at 0.3 % of the ‘real’ value. The bias of the GREG estimator is considerably lower than that of the EBLUP estimator, which often yields better general results owing to its lower variance. In the case of the transport section, however, none of the estimators used produced better results than those obtained by means of direct estimation.

Estimation of revenue by PKD section and region (64 domains in all) Owing to limited space, the results were confined to the expected value of revenue for two PKD sections. Additionally, Figures 4 (manufacturing) and 5 (construction) depict differences in the expected value of estimators and the ‘real’ values. The resulting discrepancies are obvious, given the nature of available data and the method used, but they are largely compatible with the ‘real’ values.

Dolnośląskie 9000 Zachodniopomorskie Kujawsko-pomorskie 8000

7000 Wielkopolskie Lubelskie 6000

5000

4000 Warmińsko-mazurskie Lubuskie 3000

2000

1000

Świętokrzyskie 0 Łódzkie

Śląskie Małopolskie

Pomorskie Mazowieckie

Podlaskie Opolskie

Podkarpackie ŚREDNIA W POPULACJI EST.DIRECT EST. GREG EST. SYNTHETIC EST. EBLUP Figure 4: Expected value of estimators for revenue, manufacturing by voivodship, 2008 Source: Golata (2011)

Dolnośląskie 14000 Zachodniopomorskie Kujawsko-pomorskie

12000

Wielkopolskie 10000 Lubelskie

8000

6000 Warmińsko-mazurskie Lubuskie

4000

2000

Świętokrzyskie 0 Łódzkie

Śląskie Małopolskie

Pomorskie Mazowieckie

Podlaskie Opolskie

Podkarpackie ŚREDNIA W POPULACJI EST.DIRECT EST. GREG EST. SYNTHETIC EST. EBLUP Figure 5: Expected value of estimators for revenue, construction by voivodship, 2008 Source: Golata (2011)

10

Page 99 of 199 Table 5: REE of estimators for revenue in the construction section by voivodship, 2008 REE (%) Voivodship DIRECT GREG SYNTHETIC EBLUP Dolnośląskie 32,09 19,79 17,02 9,25 Kujawsko-pomorskie 40,01 15,49 23,71 14,08 Lubelskie 42,32 18,34 20,47 13,85 Lubuskie 70,40 21,34 21,93 11,31 Łódzkie 42,68 18,56 28,84 14,56 Małopolskie 53,21 14,27 22,15 12,68 Mazowieckie 54,81 20,02 13,77 9,01 Opolskie 56,66 22,50 30,17 17,60 Podkarpackie 39,10 18,79 39,15 23,01 Podlaskie 58,30 73,16 22,77 19,41 Pomorskie 91,56 19,28 24,54 18,47 Śląskie 29,52 17,92 24,65 11,71 Świętokrzyskie 136,00 34,22 29,27 25,34 Warmińsko-mazurskie 43,70 12,70 25,19 14,78 Wielkopolskie 106,50 27,77 24,94 24,76 Zachodniopomorskie 54,24 19,28 21,37 13,22 Source: Golata (2011)

Measures of precision in tab. 5 show an evident improvement in efficiency due to the use of indirect estimation and auxiliary data from administrative databases.

Synthetic assessment of estimates for all domains by section When the Relative Estimation Error (REE, see tab. 6) is chosen as a measure of precision, accounting for both precision and bias with respect to the ‘real’ values in the MEETS real dataset, one can observe an interesting tendency. The use of direct estimation based on auxiliary information form administrative databases contributes significantly to the improvement in estimation precision in the case of such variables as revenue, number of employees and wages. This improvement can be as much as 50% of the REE obtained by applying direct estimation.

Table 6: Mean REE for all domains by section, 2008 Estimator VARIABLE DIRECT GREG SYNTHETIC EBLUP Mean REE for all domains (%) Revenue 1.62 0.87 1.04 0.81 Number of employees 0.73 0.23 0.34 0.23 Wages 0.70 0.43 0.49 0.39 weighted mean REE for all domains (%) Revenue 1.30 0.57 0.90 0.55 Number of employees 0.51 0.18 0.30 0.18 Wages 0.55 0.37 0.50 0.37 Source: Golata (2011)

Synthetic assessment of estimates for all domains by section and voivodship When estimation is conducted at a lower level of aggregation, one can generally expect a decrease in estimation precision. That was also the case this time. Values of REE, used as a measure of precision with respect to such variables as revenue, number of

11

Page 100 of 199 employees and wages, indicate a significant improvement in comparison with direct estimation. The lower values of REE (a decrease from 35.5% to 13.6% (Wages) or from 24.7% to 6.6% (Number of employees) obtained as a result of using administrative register data is promising.

Table 7: Mean REE for all domains by section and voivodship, 2008 Estimator VARIABLE DIRECT GREG SYNTHETIC EBLUP Mean REE for all domains (%) Revenue 64.25 54.63 37.14 41.87 Number of employees 24.66 12.14 6.27 6.59 Wages 35.54 25.73 14.38 13.60 weighted mean REE for all domains (%) Revenue 53.66 26.26 25.73 19.30 Number of employees 15.64 7.50 4.37 4.50 Wages 24.89 17.50 13.00 11.35 Source: Golata (2011)

Finally, the use of weights accounting for the significance of large and medium enterprises has an evident effect on the combined assessment of estimation precision.

4. Empirical evaluation of SDE for linked data - integrating two sample data – simulation study

The second simulation study referred to situation when data from two samples were integrated. It was based on a realistic population. A pseudo-population using real data form Polish micro-census 1995 was constructed. The pseudo-population was called POLDATA and consists of 2 000 000 individuals 15 years or older grouped into 16 strata11. But for the purpose of this study, the pseudo-population was restricted only to three strata, which refer to the following three viovodships: dolnoslaskie, kujawsko- pomorskie and wielkopolskie thus finally consisted of 374 374 individuals. This pseudo-population was the basis on which the sampling procedure was applied. The study was aimed at estimation of labour market status for NTS3 as domains. Precisely the characteristics to be estimated was the employment rate defined as the percentage of employed population 15 years and older. Therefore dataset A could be compared to Labour Force Survey (LFS), which due to small sample size does not yield estimates for local labour market (NTS3). Dataset B is much larger in terms of the number of records, but unfortunately does not include all variables important in the labour market analysis. Lack of these variables prevents construction of the model, which according to previous experience, could be used to estimate the necessary characteristics. This scarcity can be removed by adding variables observed in dataset A (LFS) to dataset B. The decision as to which file should be the donor or the recipient depends on the character of the study. In one approach, the file with more records is treated as a recipient, to prevent a loss of information (see Raessler, 2002). Other Authors have pointed out that duplication of information from a smaller set to larger raises risk of duplication, and thus distorts the distribution (see Scanu, 2010). Both situations could be considered. The smaller dataset being the recipient file and the larger as donor,

11 The number of voivodships in Poland.

12

Page 101 of 199 seems even more realistic in SDE, especially when making use of administrative records. Sample of type A, though small, containing data for many variables, represents relatively comprehensive characteristics of the population in task. It could be compared with Labour Force Survey (LFS). Samples in Polish LFS cover about 0,05% of the population aged 15 years and more. The lowest level of administrative division on which the LFS data (estimates) are available is voivodship. It is due to the representative character of the survey and the sample size. The data on lower levels of territorial division are biased with too high random error, similarly to additional breakages on the voivodship level.

The study was conducted according to the following schema: 1. Two types of random samples were drawn from the POLDATA in 100 replicates: a. Sample type A were drawn using two stage stratified sampling design with proportional allocation12. The strata were defined as voivodships (NTS2) - according to the territorial division of the country. The primary stage units were defined as communes – gminas (NTS5) and on second stage individuals were chosen. On the second stage the simple random sampling without replacement (SRS) was applied. The overall sample size equaled to about 1%. b. Sample type B were drawn with stratified proportional sampling. Similarly as for sample type A, voivodships were defined as strata and then 5% SRS was implemented. 2. The following variables were considered: AREA VARIABLES: i. NUTS 2 – Voivodship – 3 categories ii. NUTS 3 – 11 units AGE – 3 categories: 0 = less than 30 1 = 30 - 44 2 = 45 and over GENDER – 2 categories: 0 = male 1= female CIVIL STATUS – 3 categories: 0 = divorced or widowed 1= married 2 = single PLACE OF RESIDENCE – 3 categories 0 = rural areas and towns of 1= town 2 – 50 thousands 2 = town 50 thousands and less than 2 thousands over EDUCATION LEVEL – 4 categories: 0 = university 1= elementary 2 = vocational 3 = secondary LABOUR MARKET STATUS – 4 categories: 0 = unemployed 1= employed 2 = economically inactive a. Samples of type A contained all the variables listed above b. Samples of type B missed information about education level 3. Beginning with this step, the following estimation procedures were conducted: a. The two random samples A and B were matched. One of the simplest but also most frequently used nonparametric procedure for statistical matching based on k nearest neighbours13 was applied (kNN). And the estimation procedure used weights according to Rubin (1986)

12 The sampling procedure was not exactly the same as in case of LFS, but also follows the two-stage household sampling. Sampling scheme for the LFS defines census units called census clusters in towns or enumeration districts in rural areas, as the primary sampling units subject to the first stage selection. Second stage sampling units are dwellings. 13 As k = 1, the imputation method was reduced to distance hot deck.

13

Page 102 of 199 b. The two random samples A and B were matched using the kNN and the estimation procedure applied special weights calibrated according to domains defined for estimation 4. To the linked data the EBLUPGREG program was applied and in each run the following estimates of economic activity for local labour market (domains defines as NTS3) were obtained: a. DRIECT b. GREG i. upon Sample B with no education ii. upon Sample B with education matched and Rubin’s weights approach iii. upon Sample B with education matched and calibration weights approach c. SYNTHETIC i. upon Sample B with no education ii. upon Sample B with education matched and Rubin’s weights approach iii. upon Sample B with education matched and calibration weights approach d. EBLUP i. upon Sample B with no education ii. upon Sample B with education matched and Rubin’s weights approach iii. upon Sample B with education matched and calibration weights approach 5. The estimates obtained in each run were used to provide the empirical evaluation of the estimation precision: a. Empirical variance b. Empirical bias c. Empirical REE

The integration algorithm Since both databases were samples, they most probably did not contain data about the same person, nor they had a unique linkage key. Consequently, such data sources could not be integrated using the deterministic approach. In order to achieve the desired objective, statistical matching was implemented. The integrating algorithm usually may be broken down into 6 basic steps (D’Orazio, Di Zio, Scanu (2006)): 1. Variable harmonisation 2. Selection of matching variables and their standardization or dichotomization 3. Stratification 4. Calculation of distance 5. Selection of records in the recipient and donor datasets with the least distance 6. Calculation of the estimated value of variables The harmonization of variables involves adjusting of definitions and classifications used in both ‘surveys’: dataset A and dataset B. The fact that in the simulation conducted both samples were drawn from the same pseudo-population, allowed us to skip the harmonization step. But the importance of these procedures should be stressed. The second stage was selecting the matching variables to estimate the measure of similarity between records. In our case the following variables were selected: gender, age, marital status and place of residence. As this set of variables includes categorical as well as quantitative variables their standardization and dichotomization was necessary. So the qualitative variables were transformed into binary ones. The quantitative variable: age was categorized and dichotomized as well. The third step was to stratify. The strata was created on the basis of two variables: NUTS3 and labour market status. There was eleven NUTS3 subregions in the

14

Page 103 of 199 population but due to small number of units two of them were merged. Altogether there were 27 strata created: 9 subregions (NUTS3 regions 3 and 4, and also 41 and 42 were merged) x 3 attributes of the employment status (employed, unemployed, economically inactive). An important reason for stratifying the dataset was to optimize the computing time 14. The measure of record similarity used in the integration was the Euclidean Squared Distance given by the formula:

N Ki = − 2 d ,BA ∑∑ Aik aa Bik )( (2) i==1 k 1 where:

aAik – binary variables created in the process of dichotomization of qualitative variables (i-th category of k-th variable). For a given record in recipient file, the algorithm searches for a record in donor file for which the distance measure is the smallest. The choice of Euclidean Squared Distance was motivated by the use of the integration algorithm developed by Bacher (2002). The algorithm was modified and adjusted for purposes of the simulation. The study was performed under conditional independence assumption (CIA). The integration algorithm yielded a dataset containing 18 715 records (the number of records in Sample B - the larger one) and 7 variables describing the demographic and economic characteristics of Polish population as listed above15.

Rubin approach Survey data for estimation or integration process generally are drawn from population according to complex sampling schema. When this is the case, it is necessary to adjust sampling weights in estimation process. There are three different approaches: file concatenation proposed by Rubin (1986), case weights calibration (Renssen, 1998) and Empirical Likelihood according to Wu (2004). Rubin (1986) suggested to combine the two files A and B into AB and calculate new weight wAB for each ith unit in the new file (with some corrections). If the ith unit in the sample A is not represented in sample B, than its inverse probability equals to zero (under sampling schema B ). In such case weight of this unit in the concatenated file AB is simply its weight from sample A - wAi. This means not only that the population in task is the union of A ∪ B , but also that the estimated distributions are conditional of Y given (wAB ; Z) and Z given (wAB ; Y). In our study the file A was not concatenated to file B. The integration process to join A and B was to impute in B originally unobserved variables Z that characterize the level of education by using the values of X, which were observed in both files. Thus, as suggested by Rubin, the weight of each observation in the set B remained unchanged.

Calibration approach When samples are drawn according to different complex survey designs it is important to consider the weights to preserve the distribution of the variable in task. Especially when the survey is originally planned for the whole population and finally the estimation is conducted for unplanned domains.

14 In spite of dividing the data set into strata, duration of the integration process amounted to about 6 hours (Intel Core i5 processor, 4 GB RAM). 15 All programming and calculations was made by W. Roszka in the Department of Statistics at the Poznan University of Economics.

15

Page 104 of 199 The impact of sampling designs for the efficiency in small area estimation is a question difficult to answer due to many optimisation problems. According to Rao J.N.K., (2003) most important design issues for small domain estimation are such as: number of strata, construction of strata, optimal allocation of a sample, selection probabilities. This list can be enlarged by definition of optimisation criteria, availability of strongly correlated auxiliary information, choice of estimators and so on. In practice it is not possible to anticipate and plan for all small areas. As a result indirect estimators will always be needed, given the growing demand for reliable small area statistics. However, it is important to consider design issues that have an impact on small area estimation, particularly in the context of planning and designing large-scale surveys (Sarndal et al 1992). According to Särndal (2007) calibration is a method of estimating the parameters for the finite population, which applies new “calibration” weights. The calibration weights need to be close to the original ones and satisfy the so-called calibration equation. Applying calibration weights to estimate parameters of the target variable is especially needed in case of no occurrence, no response or other non-sampling errors to provide unbiased estimates16. These weights may also take into account relation between the target variable and an additional one to adjust the estimates to the relation observed at global level. Therefore the GREG estimator is widely used in SDE. Additionally we proposed to verify the impact of calibration weights taking into account all the matching variables to adjust the estimates for domains. Suppose that the objective of the study is to estimate the total value of a variable, defined by the formula (see Szymkowiak 2011): N ∑yY i ,= (3) i 1= where yi denotes the value of variable y for i - th unit, K,1,= Ni . {} Let us assume that the whole population K,1,= NU consists of N elements. From ⊆ this population we draw, according to a certain sampling scheme, a sample Us , π which consists of n elements. Let i denote first order inclusion probability 1 π = (∈ siP )and d = the design weight. The Horvitz-Thompson estimator of the i i π i total is given by: n ˆ HT ii ∑∑ ydydY ii .== (4) s i 1= Small sample size might cause unsufficient representation17 of particular domains in the sample, and therefore enable direct estimates. If information for the variable y is not known for some domains then the Horvitz-Thompson estimator would be characterised of high variance. Proper choice of the distance function is essential for constructing calibration weights and the results obtained. In our study the distance function was expressed by the formula which allows to find the calibration weights in an explicit form:

16 Calibration approach as a method of nonresponse treatment is described in detail in Särndal C–E., Lundström S. (2005) Estimation in Surveys with Nonresponse, John Wiley & Sons, Ltd. 17 In practice it might occur, that the domain is even not representated in the sample. In our simulation study such situation is not considered.

16

Page 105 of 199 1 m ()− dw 2 D()dw =, ii , (5) 2 ∑ d i 1= i Effective use of calibration weights w depends on the vector of auxiliary information. i Let 1 K,, xx k denote auxiliary variables which will be used in the process of finding calibration weights. In our simulation study we used calibration weights obtained for each domain using additional information from the pseudo-population. As auxiliary data the following variables were used: gender, KLM, education, age, marital status and labour market status. Let: N X j ∑xij ,= denote total value for the auxiliary variable x j , K,1,= kj , (6) i 1= where xij is the value of j-th auxiliary variable for the i-th unit T ⎛ N N N ⎞ X ⎜ i1 i2 K,,,= ∑∑∑ xxx ik ⎟ (7) ⎝ i 1= i 1= i 1= ⎠ is known vector of population totals for of auxiliary variables. ()T The vector of calibration weights w 1 K,,= ww m is obtained as the following minimization problem: () w argmin= v D dv ,, (8) subject to the calibration constraints ~ = XX , (9) where T ~ ⎛ m m m ⎞ X ⎜ ii 1 ii 2 K,,,= ∑∑∑ xwxwxw iki ⎟ . (10) ⎝ i 1= i 1= i 1= ⎠ m If the matrix d xx T is nonsingular then the solution of minimization problem (8), ∑i 1= iii subject to the calibration constraint (9) is a vector of calibration weights ()T w 1 K,,= ww m , whose elements are described by the formula: − m 1 ()−+ ˆ T ⎛ T ⎞ = ddw iii ⎜∑d iii ⎟ xxxXX i (11) ⎝ i 1= ⎠ where m m m T ˆ ⎛ ⎞ X ⎜ ii 1 ii 2 K,,,= ∑∑∑ xdxdxd iki ⎟ (12) ⎝ i 1= i 1= i 1= ⎠ and x (),,= xx T (13) ii 1 K ik is the vector consisting of values of all auxiliary variables for the i-th respondent K,1,= mi .

17

Page 106 of 199 Assessment of data integration In the literature there are different approaches to assess matching quality. Raessler (2002) proposed to assess the two files as well matched if they meet the criteria for the distribution compliance and preservation of relations between variables in the initial and matched files18. In practice it might be difficult, or sometimes even impossible to verify all those criteria (D’Orazio,2010). Although the statistical inference methods are not always suitable, especially in case of administrative data.

Table 8: Characteristics of the number of matches Characteristics of the number of matches (together with no-matched records) Over all samples Mean Std Median Mode Min Max MIN 3,80 5,12 2 0 0 49 Q1 4,48 6,12 2 0 0 78 Q2 4,95 6,80 3 0 0 115 Q3 5,35 7,68 3 0 0 171 MAX 6,39 9,48 4 0 0 288 Characteristics of the number of matches (no-matched records omitted) Over all samples Mean Std Median Mode Min Max MIN 5,64 5,34 4 1 1 49 Q1 6,60 6,41 5 1 1 78 Q2 6,99 7,18 5 1 1 115 Q3 7,53 8,21 5 2 1 171 MAX 8,54 10,63 6 4 1 288 Source: own study.

In the simulation process the mean number of matches over all samples equaled to 3,8 for all records and 5,64 if the no-matched records were omitted (see tab. 8). And the highest number of matches amounted to 8,54 (no-matched records omitted). In the study the following quality assessment measures were used: - total variation distance (D’Orazio, Di Zio, Scanu, 2006): (14) - Bhattcharyya coefficient (Bhattacharyya, 1943): (15) where: – proportion of i-th category of a variable in the fused file, - proportion of i-th category of a variable in the donor file. Both of these coefficients are in the range of . In case of total variation distance, the lower coefficient, the greater distribution compatibility is achieved. The value indicating the acceptable similarity of distributions is commonly assumed as . Conversely, the lower the value of the Bhattacharyya coefficient, the lower the

18 There are four criteria specified by Rassler: (i) The true, unknown distribution of matched variables is reproduced in the newly created, synthetic file. (ii) The real, unknown cumulative distribution of the variables is maintained in the newly created, synthetic dataset. (iii) Correlation and higher moments of the cumulative distribution of and a marginal distribution of and are preserved. (iv) At least marginal distributions of and in the fused file are preserved.

18

Page 107 of 199 compatibility of distributions achieved. As the coefficient proposed by Bhattacharyya generally takes high value, two other measures of structure similarity were applied: and , (16) where: the minimum proportion of i-th category in the fused and donor file, the maximum proportion of i-th category in the fused and donor file. These coefficients take values from the interval and is generally greater than . The greater the value of any of these coefficients, the greater the compatibility of the distributions. Values that indicate the acceptable similarity of distributions are usually assumed to be and (see Roszka 2011).

Table 9: Total variation distance as matching quality measure Place of Marital Source of Matching variable Gender residence Status maintenance MIN 0,0830 0,0000 0,0070 0,0040 Q1 0,1528 0,0030 0,0129 0,0150 Q2 0,1790 0,0050 0,0160 0,0198 Q3 0,2201 0,0100 0,0221 0,0245 MAX 0,2920 0,0270 0,0405 0,0370 Source: own study.

Table 10: Bhattacharyya coefficient as matching quality measure Place of Marital Source of Matching variable Gender residence Status maintenance MIN 0,9355 0,9996 0,9976 0,9978 Q1 0,9607 0,9999 0,9988 0,9991 Q2 0,9691 1 0,9993 0,9995 Q3 0,9769 1 0,9996 1 MAX 0,9916 1 1 1 Source: own study.

Very good matching quality coefficients were achieved for the variables “gender”, “marital status” and “source of maintenance”. Much worse quality measures were obtained for the variable “place of residence (see tab. 9 and 10). This results from the fact that “class of place of residence” variable was characterized by a weaker compatibility prior to integration. The similarity coefficients presented in tab. 9 and 10 characterise the matching quality in a synthetic way. That is, over all replications and additionally, they do not take into account differences of distributions across domains. Compatibility of the distributions observed for the whole sample, of course, do not translate automatically to all domains for which estimation of economic activity was conducted in the next stage. The discrepancy in the compliance applies to both individual samples and domains. Typically, in the conformity assessment distribution of matching variables is taken into account. In case of a simulation study, there was also the possibility to evaluate distribution of the matched variable.

19

Page 108 of 199 Table 11: Education distribution by regions in population and direct estimates upon exemplary sample with matched variable Proportion of population of the following education level * NTS3 Exemplary sample Population BC(pf;pd) Wp1 Wp2 Elementa Vocati Seco Univer Elemen Vocati Seco Univer ry onal ndary sity tary onal ndary sity 1 0,47 0,27 0,20 0,06 0,45 0,28 0,21 0,06 0,9997 0,976 0,954 2 0,55 0,16 0,24 0,05 0,43 0,29 0,22 0,06 0,9872 0,867 0,765 3 0,54 0,19 0,18 0,08 0,47 0,30 0,18 0,04 0,9900 0,894 0,808 4 0,25 0,16 0,41 0,18 0,29 0,19 0,34 0,19 0,9967 0,923 0,857 5 0,49 0,29 0,16 0,05 0,42 0,31 0,20 0,06 0,9970 0,929 0,867 6 0,50 0,28 0,16 0,06 0,49 0,26 0,19 0,06 0,9994 0,971 0,944 38 0,51 0,26 0,21 0,03 0,48 0,29 0,19 0,05 0,9980 0,952 0,908 39 0,46 0,33 0,16 0,06 0,42 0,33 0,19 0,06 0,9988 0,961 0,925 40 0,46 0,34 0,13 0,07 0,43 0,30 0,20 0,06 0,9944 0,924 0,858 41 0,52 0,25 0,17 0,05 0,51 0,25 0,19 0,05 0,9998 0,984 0,969 42 0,54 0,20 0,20 0,07 0,24 0,24 0,34 0,18 0,9467 0,705 0,545 All 0,49 0,27 0,18 0,06 0,44 0,28 0,21 0,07 0,9990 0,956 0,916 domains * The first sample was compared Source: Own calculations

Comparability of the distributions for the variable in task „education” showed that the distributions were preserved. Table 11 provides the comparison of education distribution by domains in population with direct estimates upon one exemplary sample after matching variable education. The Bhattacharyya coefficient is generally close to one, on average greater than 0.99. Only for domain 42, it takes value lower than 0.95 (in red colour). For this specific domain also the other two similarity coefficients take exceptionally low values. But their more detailed analysis indicates that the education distribution is well maintained only for three domains (number: 1, 6 and 41). The results presented refer to the situation when originally sampling weights were applied. In case of weights calibrated for domains, the distributions were identical.

Domain Specific Evaluation of Estimation Precision Assessing the quality of the estimates from domain specific perspective, one can take into account both: single sample and average values for each domain upon 100 replications. The results obtained for estimators used in the study and different research approaches: with imputed education and calibrated weights are presented in a graphical and in tabular form in Annex 2 (tab. A1). The exemplary estimates obtained for domain 1 in each of 100 replicates are shown in fig 6. And fig. 7 represents expected values of the one selected estimator (EBLUP) for different approaches by domains. First, it could be noticed that calibrated weights applied to direct estimator gave the ‘true’ value in each replicate. As concerns the GREG estimator, the one with imputed education and calibrated weights resulted in estimated close to the ‘true value’ in all replicates. The variation of the estimates was also small. Combining GREG with synthetics estimator resulted in a considerable increase in EBLUP estimates variation, even in comparison with direct estimator.

20

Page 109 of 199

Figure 6: Estimates of the percentage of economically active, different estimators and research approaches, Domain 1 Source: Own calculations

21

Page 110 of 199

Figure 7: Expected value of the EBLUP estimator for different approaches by domains, Source: Own calculations

It is worth to noticed that thanks to the simulation approach, the results discussed could be analysed with reference to the ‘true’ value, which usually is unknown. Another reference values might constitute the estimates obtained model including education or not (fig. 7). No matter which reference value would be chosen, the estimates taking into account the imputed education are on average clearly overestimated in two domains (4 and 42). These results confirm need for careful evaluation of integration process and convergence of the distribution of all variables, especially those exploit as auxiliary.

Synthetic Evaluation of estimation precision over all domains Assessing the estimation precision over all domains average values of mean and relative estimation errors (MSE and REE) obtained for different research approaches were analysed.

Figure 8: REE(GREG) for different research approaches by domains Source: Own calculations

22

Page 111 of 199

Figure 9: REE(SYNTH) for different research approaches by domains Source: Own calculations

Figure 10: REE(EBLUP) for different research approaches by domains Source: Own calculations

As it comes from presentation of relative estimation error for GREG and EBLUP estimators across all domains: estimates including imputed education improve precision obtained (red and yellow bars on fig. 8 and 10). Of course, this statement should not be generalised, as in case of SYNTH estimator, the presented results indicate just an opposed opinion (for each domain, fig. 9).

23

Page 112 of 199 As the main issue in the study was to evaluate the estimates for linked data, the results obtained for samples with real education, were considered for reference purposes (presented in grey in tables 11 and 12). However results obtained for samples with imputed education included in the model (with original or calibrated weights) might also be compared to the ones with no education, as this reflects more realistic situation.

Table 11: MSE for different estimators and research approaches Type of estimator Research DIR GREG SYNTH EBLUP DIR GREG SYNTH EBLUP approach Average of MSE over all domains Weighted average of MSE over all domains Education 0,0136 0,0115 0,0082 0,0108 0,0117 0,0099 0,0081 0,0094 No Education 0,0136 0,0120 0,0094 0,0113 0,0117 0,0103 0,0093 0,0099 Imputed 0,0136 0,0115 0,0117 0,0111 0,0117 0,0098 0,0116 0,0096 Education Imputed Education, 0,0154 0,0131 0,0117 0,0111 0,0125 0,0106 0,0116 0,0096 Calibration Weights Source: Own calculations

Table 12: REE for different estimators and research approaches Type of estimator Research DIR GREG SYNTH EBLUP DIR GREG SYNTH EBLUP approach Average of REE over all domains Weighted average of REE over all domains Education 0,0282 0,0239 0,0171 0,0223 0,0242 0,0205 0,0169 0,0196 No Education 0,0282 0,0248 0,0195 0,0235 0,0242 0,0213 0,0191 0,0205 Imputed Education 0,0282 0,0229 0,0234 0,0221 0,0242 0,0199 0,0232 0,0194 Imputed Education, Calibration Weights 0,0318 0,0273 0,0234 0,0221 0,0259 0,0220 0,0232 0,0194 Source: Own calculations

Similarly as in the simulation study for business statistics, weighting the measures of estimation precision with domain size, indicates on average higher quality assessment. It could be also noticed that estimators for small domains perform typically for linked data equally as for real data. Synthetic estimator (SYNTH) provides most efficient estimates, but as usual, they might often be biased. The precision depends on the relation of matched variable and the estimated one. In presented study including imputed education into the model slightly improved estimates of the percentage of economically active population.

5. Conclusions

Data Integration is used to combine information from distinct sources of data which are jointly unobserved and refer to the same target population. Fusing distinct data sources to be available in one set enables joint observation of variables from both files. The integration process is based on finding similar records and the similarity is calculated on the basis of common variables in both datasets. Similarity of the idea concerning small domain estimation and data integration techniques could be specified as follows19: 1. Auxiliary information. Both techniques refer to external data sources

19 This specification is of course, should not be considered as full and final

24

Page 113 of 199 - SDE in order to obtain auxiliary variable that can help to improve estimation precision for domains - DI to provide more comprehensive data sets which allow for reducing the respondents burden and bias resulting. Joint application of both methods might result in increasing both: estimation precision and the scope of information available, especially in the context of small domains. But estimates on linked data require good matching quality: - method for data integration - direct measure of consistency of the distribution of matched variable is needed - earlier constrains help to avoid improper values - micro integration processing - calibration might be considered as a method for adjusting sample design to estimates for unplanned domains. 2. Correlation and regression. The two data sources are combined upon in-depth correlation analysis: - in SDE by model-based estimation for domains - in DI this correlation is crucial in the matching process for a) common matching variables and for b) ‘imputed’ - jointly unobserved variable ‘Z’. Taking the above into account, in both groups of methods, variable harmonisation is important. This involves not only definition of the variables, grouping and classification issues, but also designation of statistical units and resulting aggregation level for the analysis. So appears the danger of ecological fallacy, of studying the relationship between variables that are specified for different territorial units, or at different levels of aggregation. The possibility of recognizing a variety of statistical units brings methodological problem, namely how to estimate the relation for a number of levels simultaneously. In practice, estimates for small areas frequently used regression estimators, assuming tacitly that the true values of the parameters (β) in the regression equation at the level of individual units are the same as for the parameters obtained from the mean values for the spatial units (see Heady and Hennel, 2002, p. 5). Application of the mixed models might be considered as one of the solutions suggested to avoid the ‘ecological effect’. It should be stressed that the success of any model-based method depends on distributions of estimated variables and covariates, correlation analysis – choice of good predictors of the study variables, model diagnostic. 3. Sampling design. Often the two data sets are obtained from independent sample surveys of complex designs, this raises a number of methodological problems: - in SDE with providing the sampling schema that would be optimal in estimation for domains and in assessing precision of the estimates. According to Rao J.N.K. (2003) most important design issues for small domain estimation are the following: number of strata, construction of strata, optimal allocation of a sample, selection probabilities. This list can be enlarged by adding the problem of defining the optimisation criteria, possibilities in obtaining strongly correlated auxiliary information, choice of estimators taking into account their efficiency under specific sampling designs. - in DI the sampling design cannot be ignored and different weights assigned to each sample unit must be considered in order to preserve the population structure and variable distribution. In literature Rubin’s file concatenation (1986) or Renssen’s calibration (1998) is proposed. Alternatively Wu (2004) suggests empirical likelihood method.

25

Page 114 of 199 4. Stratification. In both methods stratification has a significant meaning. In SDE where data are drawn from population with no respect to domains for which finally estimation is conducted, post-stratification could be considered as a method of optimization the sampling schema. By introducing stratification in DI we optimize the integration process by reducing the computing time. 5. „Theory & Practise”. For both groups of methods it is often observed that situations observed in practice do not correspond to the theoretical solutions. On the basis of the study conducted the following of them could be mentioned: - High differentiation in correlation across domains between variables estimated on the basis of DG-1 statistical reporting and auxiliary variables from administrative databases, including PIT and CIT - The non-homogenous distributions of estimated variables and covariate data may imply the need for robust estimation (modified GREG, Winsor and local regression). This solution, however, is connected with the highly complicated and time-consuming estimation techniques - Administrative problems connected with access to auxiliary data, which limit their usefulness in short-term statistics 6. Estimates on linked data.

According to Rao (2005), small area estimation is a striking example of the interplay between theory and practice. But he stresses that, despite significant achievements, many issues require further theoretical solutions, as well as empirical verification. Among these issues Rao points primarily on: a) benchmarking model-based estimators to agree with reliable direct estimators at large area levels, b) developing and validating suitable linking models and addressing issues such as errors in variables, incorrect specification of the model and omitted variables, c) development of methods that satisfy multiple goals: good area-specific estimates, good rank properties and good histogram for small areas. Similarly, Data Integration is becoming a major issue in most countries, with a view to using information available from different sources efficiently so as to produce statistics on a given subject while reducing costs and response burden and maintaining quality. However, the use of DI methods requires not only further theoretical solutions, but also many practical test. Typically, DI methods seem to be understandable and easy to use, but in practice significant complications occur. Similarity of both methods should be understood also as a set of common problems requiring further research and analysis that could enable them to wider use in official statistics.

References Bacher J. (2002) Statistisches Matching - Anwendungsmöglichkeiten, Verfahren und ihre praktische Umsetzung in SPSS, ZA-Informationen, 51. Jg. Balin M., D’Orazio M., Di Zio M., Scanu M., Torelli N. (2009) Statistical Matching of Two Surveys with a Common Subset, Working Paper n. 124, Universita Degli Studi di Trieste, Dipartimento di Scienze Economiche e Statistiche. Bracha (1994) Metodologiczne aspekty badania małych obszarów [Methodological Aspects of Small Area Studies], „Studia i Materiały. Z Prac Zakładu Badań Statystyczno-Ekonomicznych” nr 43, GUS, Warszawa (in Polish).

26

Page 115 of 199 Chambers R., Saei A. (2003) Linear Mixed Model with Spatial Correlated Area Effect in Small Area Estimation. Chambers R., Saei A., 2004, Small Area Estimation Under Linear and Generalized Linear Mixed Models With Time and Area Effects, Southampton Statistical Sciences Research Institute. Chambers, R.L, Falvey, H., Hedlin, D., Kokic P. (2001) Does the Model Matter for GREG Estimation? A Business Survey Example, in: Journal of Official Statistics, Vol.17, No.4, 527-544. Chambers, R.L. (1996) Robust case-weighting for multipurpose establishment Surveys in: Journal of Official Statistics, Vol.12, No.1, 3-32. Choudhry G.H., Rao, J.N.K. (1993) Evaluation of Small Area Estimators. An Empirical Study, in: Small Area Statistics and Survey Designs, eds G. Kalton, J. Kordos, R. Platek, vol. I: Invited Papers, Central Statistical Office, Warsaw. D’Orazio M., Di Zio M., Scanu M. (2006) Statistical Matching. Theory and Practice, John Wiley & Sons, Ltd. Dehnel, G. (2010) Rozwój mikroprzedsiębiorczości w Polsce w świetle estymacji dla małych domen,] Wydawnictwo Uniwersytetu Ekonomicznego w Poznaniu, Poznan [Development of micro-business in the light of estimation for small domains]. Deville J–C.. Särndal C–E. (1992) Calibration Estimators in Survey Sampling, in Journal of the American Statistical Association, Vol. 87, 376–382. Di Zio M. (2007) What is statistical matching, Course on Methods for Integration of Surveys and Administrative Data, Budapest, Hungary. Eurarea Project Reference Volume All Parts (2004) The EURAREA Consortium http://www.ons.gov.uk/ons/guide-method/method-quality/general-methodology/spatial-analysis-and- modelling/eurarea/downloads/index.html. Ghosh M., Rao J.N.K. (1994) Small Area Estimation: An Appraisal, „Statistical Science” vol. 9, no. 1. Gołata E. (2009) Opracowanie dla wybranych metod integracji danych reguł, procedur integracji danych z różnych źródeł, GUS Internal materials, Poznań, Poland [Development of selected methods for data integration rules, procedures, data integration from various sources]. Golata E. (2011) A study into the use of methods developed by small area statistics in: Use of Administrative Data for Business Statistics (2011) Final Report under the grant agreement No. 30121.2009.004-2009.807, GUS, Warszawa. Heady P., Hennel S. (2002) Small Area Estimation and the Ecological Effect – Modifying Standard Theory for Practical Situations, Office for National Statistics, London, IST 2000-26290 EURAREA, Enhancing Small Area Estimation Techniques to Meet European Needs. Herzog T. N., Scheuren F. J., Winkler W.E. (2007) Data Quality and Record Linkage Techniques, Springer New York. Kadane, J.B. (2001) Some Statistical Problems in Merging Data Files in: Journal of Official Statistics, No. 17, 423-433. Lehtonen R., Veijanen A. (1998) Logistics Generalized Regression Estimators in: Survey Methodology, vol. 24. Moriarity, C., Scheuren, F. (2001) Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure in: Journal of Official Statistics, No. 17, 407-422. Multivariate analysis of systematic errors in the Census 2002, and statistical analysis of the variables of NC 2002 supporting the use of small area estimates. Paradysz J. ed., Report for Central Statistical Office, November 2008, Centre for Regional Statistics, University of Economics in Poznan. Pfeffermann D. (1999) Small Area Estimation – Big Developments, in: Small Area Estimation, International Association of Survey Statisticians Satellite Conference Proceedings, Riga 20-21 August 1999, Latvia. Pietrzak-Rynarzewska B., Jozefowski T. (2010) Assessment of the possibilities of using population register in the census in: Measurement and Information in the economy (in Polish: Pomiar i informacja w gospodarce), published by Poznan University of Economics. Raessler S. (2002) Statistical Matching. A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches, Springer, New York, USA.

27

Page 116 of 199 Rao J.N.K. (1999) Some Recent Advances in Model-Based Small Area Estimation in: Survey Methodology, vol. 25, Statistics Canada. Rao J.N.K. (2003) Small Area estimation, Wiley-Interscience. Rao J.N.K. (2005) Interplay Between Sample Survey Theory and Practice: An Appraisal, Survey Methodology, Vol. 31, No. 2, 117-138. Renssen R. H. (1998) Use of Statistical Matching Techniques in Calibration Estimation in: Survey Methodology, Vol. 24, No. 2, 171 – 183, Statistics Canada. Roszka W. (2011) An attempt to apply statistical data integration using data from sample surveys in: Economics, Management and Tourism, South-West University “Neofit Rilsky” Faculty of Economics and Tourism Department, Duni Royal Resort, Bulgaria. Rubin D. B. (1986) Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations, in: Journal of Business and Economic Statistics, Vol. 4, No. 1, 87 – 94, stable URL: http://www.jstor.org/stable/1391390. Särndal C.E., Swensson B., Wretman J. (1992) Model Assisted Survey Sampling, Springer Verlag, New York. Särndal C. E. (2007) The Calibration Approach in Survey Theory and Practice in: Survey Methodology. Vol. 33, No. 2, 99–119. Särndal C–E., Lundström S. (2005) Estimation in Surveys with Nonresponse, John Wiley & Sons, Ltd. Scanu M. (2010) Introduction to statistical matching in: ESSNet on Data Integration. Draft Report of WP1. State of the art on statistical methodologies for data integration, ESSNet. Scheuren, F. (1989) A Comment on “The Social Policy Simulation Database and Model: An Example of Survey and Administrative Data Integration”, Survey of Current Business, 40-41. Skinner C. (1991) The Use of Estimation Techniques to Produce Small Area Estimates, A report prepared for OPCS, University of Southampton. Szymkowiak M. (2011) Assessing the feasibility of using information from administrative databases for calibration in short-term and annual business statistics in: Use of Administrative Data for Business Statistics (2011) Final Report under the grant agreement No. 30121.2009.004-2009.807, GUS, Warszawa. Use of Administrative Data for Business Statistics (2011) Final Report under the grant agreement No. 30121.2009.004-2009.807, GUS, Warszawa. van der Putten P., Kok J. N., Gupta A, (2002) Data Fusion through Statistical Matching, Center for eBusiness, MIT, USA. Veijanen A., Djerf K., Sőstra K., Lehtonen R., Nissinen K. (2004) EBLUPGREG.sas, program for small area estimation borrowing srength over time and space using unit level model, Statistics Finland, University of Jyväskylä. Wallgren A., Walgren B. (2007) Registered based Statistics Administrative Data for Statistical Purposes, John Wiley & Sons Ltd. Winkler, W.E. (1990) String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage, in: Section on Survey Research Methods, 354-359, American Statistical Association. Winkler, W.E. (1994) Advanced Methods For Record Linkage, Bureau of the Census, Washington DC 20233-9100. Winkler, W.E. (1995) Matching and Record Linkage, in: Business Survey Methods, B. Cox ed. 355-384, J. Wiley, New York. Winkler, W.E. (1999) The State of Record Linkage and Current Research Problems, RR99-04, U.S. Bureau of the Census, http://www.census.gov/srd/www/byyear.html. Winkler, W.E. (2001) Quality of Very Large Databases, RR2001/04, U.S. Bureau of the Census. Wu Ch. (2005) Algorithms and R Codes for the Pseudo Empirical Likelihood Method in Survey Sampling in: Survey Methodology, Vol. 31, No. 2, 239 – 243.

28

Page 117 of 199 Annex 1 Small Domain Estimators and methods to evaluate estimation precision

A. Small Domain Estimators

Direct estimator The direct estimator is commonly used in small area estimation studies as a benchmark for comparing estimator performance. ˆ DIRECT = 1 Yd ∑ yw idid (A1) ˆ ∈ N d ui d ˆ = = π where d ∑ wN id wid /1 id ∈ ui d assuming that π = , for each ≠ dd ' or i ≠ j jdid ', 0 standard estimation error is calculated using the following formula: 2 ⎛ ⎞ ˆ ˆ DIRECT = ⎜ 1 ⎟ −− ˆ DIRECT 2 (YESM d ) ∑ idid )(1( Yyww did ) (A2) ⎜ ˆ ⎟ ∈ ⎝ N d ⎠ ui d It is characterised by high variability for most small areas; besides, its application doesn’t guarantee estimates of the target variable for all domains – particularly with respect to cases of non-inclusion in the sample for a given domain. For this reason it is not very useful for estimation (also Särndal et al.1992, Ghosh, Rao, 1994, Rao, 1999, Lehtonen, Veijanen, 1998, Veijanen A., Djerf K., Sőstra K., Lehtonen R., Nissinen K., 2004, Eurarea Documents: Standard Estimators, 2004).

Generalised REGression estimator – GREG The Greg estimator is treated as a specific case of direct estimator. The direct estimator for a given small area is adjusted for differences between the sample and population area means of covariates. Auxiliary variables are transformed and adapted to the value of the target valuable. For this purpose, various models are used, which describe the relationship between the target variable Y and the auxiliary variable X . The standard approach is to use the ordinary regression model: T 1 y ⎛ 1 x ⎞ ˆ GREG = i ⎜ T −+ i ⎟ ˆ Yd ∑ Xd ∑ β (A3) ˆ ∈ π ⎜ ˆ ∈ π ⎟ N d si d i ⎝ N d si d i ⎠ ˆ = 1 ˆ where N d ∑ and β are estimated using the least square method. ∈ π si d i When a domain contains no data, the GREG estimator reduces to a synthetic estimator, T ˆ Xd β . The formula for the MSE estimator is: π − π π ˆ ˆ GREG = jdidijd (YESM d ) ∑∑ rgrg jdjdidid (A4) ∈∈ πππ iuuddj jdidijd The use of the auxiliary variable X can be justified by its strong correlation with the target variable Y. In this case, the variance of the GREG estimator is lower than the variance of the direct estimator. A small sample size in a domain is conducive to an increase in variance, but with increasing correlation between variables Y and X,

29

Page 118 of 199 variance is considerably reduced. One advantage of GREG estimator is its lack of bias. Assuming that multiple samples are drawn, the expected value of the GREG estimator for a domain is close to the real value of the variable for this domain in the population where: ˆ Y ,dGREG –estimate in domain d obtained by applying the GREG estimator d – domain,

gi – weight of i-th individual observation defined as: −1 ' ⎛ ⎞ ()−+= ˆ ⎜ ⎟ () i 1 d XXg ,dHT ∑ iiii xxx //' ccw ii (A5) ⎜ ∈ ⎟ ⎝ si d ⎠ X – auxiliary variable ˆ X ,dHT – direct Horvitz-Thompson estimate of the total value of the auxiliary variable x in domain d

X d – total value of the auxiliary variable x in domain d

Synthetic estimator In synthetic estimation for a population divided into homogenous categories it is assumed that the means computed for units belonging to each category are identical. Estimation for domains is the weighted mean of estimated means determined on the basis of sampled units. The weight depends on the share of a small area within a category. The synthetic estimator is unbiased provided the assumption is met. In reality, however, this happens extremely rarely. The regression synthetic estimator constructed on the basis of a two-level model for unit data of the variable Y , accounting for the correlation with the values of covariates X at the level of individual units and territorial units: T ++= idid β euxy idd (A6) σ 2 σ 2 where: d ~ Niidu ),,0( idu ~ Niide e ),0( described by the formula: ˆ SYNTH = T ˆ Yd X.dβ (A7) The estimator does not account for sampling weights, and MSE can be estimated using the formula: ˆ ˆ SYNTH σ 2 += ˆ T (YESM d ) ˆ XVX .. ddu (A8) where Vˆ is the covariance matrix of auxiliary variables.

The EBLUP estimator Empirical Best Linear Unbiased Predictors (EBLUP) can be explained in the following manner. They are predictors for small areas, and are the best in the sense of having the least model variance; they are linear in the sense of having a linear function of the sample values y; they are unbiased in the sense of lacking model-based bias. EBLUP is a composite estimator, combining direct linear estimators and regression synthetic estimators with weights depending on the value of MSE estimators. In the case of unit- level model, EBLUP can be defined as a weighted mean of the synthetic and GREG estimators. In the area-level model, EBLUP is a weighted mean of the direct and

30

Page 119 of 199 regression synthetic estimators. The EBLUP estimator is constructed by replacing the unknown value of variance with its estimate. The general formula of the EBLUP estimator takes the following form: ˆ EBLUP = EBLUP ˆ GREG −+ EBLUP ˆ SYNTH d d YwY d d )1( Yw d (A9)

In a more developed form, the models can be described as: ˆ γ T ˆ +−= T ˆ yY .ddd x( .dβ X) .d β (A10)

σˆ 2 where: wEBLUP γ == u (A11) d d σ 2 σ 2 + ˆe ˆu nd T y.d and x.d are mean sample values y and covariates for area d respectively, and ˆ σσ 22 β, ˆ , ˆ ue are parameters estimated on the basis of the standard linear two-level model. The MSE estimator can then be estimated using the formula: σγ ˆ 2 ˆ ˆ ed −+= γ 2 T ˆ (YESM d ) X)1( dd XV .. d (A12) nd where Vˆ is the covariance matrix of auxiliary variables.

B. Precision assessment methods

Domain specific assessment of estimates As in both cases simulation study was conducted for each of the estimators applied, expected values were computed based on results obtained in k (1000 or 100) replications to determine estimator variance, relative estimation error, and relative bias. Measures of estimation precision were delivered for each domain and for all domains combined. Thus, it was possible to make both a synthetic assessment of estimator properties and one that accounted for domain size and their unique characteristics. Mean values of estimator properties were estimated following k replications during the simulation study. In addition, distribution characteristics of the estimators were presented, where possible. The mean value of estimates following 1000 replications can be calculated from: 1000 ˆ = 1 ˆ Yd ∑Ydp (A13) 1000 p=1 Where: d - domain p=1, ... , 1000 – denotes the sample number; During the study this mean value was treated as the expected value of the estimator. The approximate value of estimator variance was thus expressed as (Bracha, 1994 p.33): 1000 ˆ ˆ = 1 ˆ − ˆ 2 (YV d ) ∑( dp YY d ) (A14) 999 p=1 The approximate value of MSE estimate was computed using the following formula (Choudhry, Rao, 1993 p. 276):

31

Page 120 of 199 1000 ˆ ˆ = 1 ˆ − 2 (YESM d ) ∑( YY ddp ) (A15) 999 p=1 where Yd denotes the „real” value of the estimated variable in a population in domain d . Root MSE is a measure which combines variance and squared bias. Its estimate is defined on the basis of MSE: ˆ ˆ = ˆ ˆ (YS ) MSEd (YESM d ) (A16) Relative Error of the Estimate (REE) was calculated on the basis of the value of MSE: ˆ (YESM ˆ ) ˆ ˆ = d (YEER ) MSEd (A17) Yd Absolute bias of the estimator was defined as the difference between the expected and real value. 1000 ˆ ˆ =−= 1 ˆ − ( ) YYYBIAS ddd ∑ YY dd (A18) 1000 p=1 On the basis of the above characteristics computed for each domain it was possible to assess estimation precision for a domain, accounting for its specific nature, especially, the number of units.

Synthetic assessment of estimates for all domains The arithmetic and weighted means (depending on domain size) of the estimated MSE in domains are expressed by: D ⎡ 1000 ⎤ D ˆ ˆ = 11 ˆ − 2 = 1 ˆ ˆ (YESM ) ∑∑⎢ ( YY ddp ) ⎥ ∑ (YESM d ) D dp==1 ⎣999 1 ⎦ D d =1

D ˆ (YESMN ˆ ) ˆ ˆ = d d W (YESM ) ∑ (A19) d =1 N where: D – number of domains D = Nd – population size in a domain d, ∑ d NN . d =1 The values of the arithmetic and weighted mean of MSE, defined as root MSE (RMSE), were calculated using the following formulas: D D ⋅ ˆ(YSN ˆ ) ˆ ˆ = 1 ˆ ˆ ˆ ˆ = MSEdd (YS ) MSEd ∑ (YS ) MSEd and (YS ) MSEdW ∑ (A20) D d =1 d =1 N Mean REE was determined using: D D ⋅ ˆ (YEERN ˆ ) ˆ = 1 ˆ ˆ ˆ = d MSEd EER MSE ∑ (YEER ) MSEd and W EER MSE ∑ (A21) D d =1 d =1 N The above characteristics enable a synthetic assessment of estimation precision regardless of domain size.

32

Page 121 of 199 Annex 2 Table A1: Relative Estimation Error of economic activity estimates for domains according to different research approach Estimator Domain Research approach REE(DIR) REE(GREG) REE(SYNTH) REE (EBLUP) 1 Education 0,0192 0,0165 0,0168 0,0159 No Education 0,0192 0,0171 0,0190 0,0166 Imputed Education 0,0192 0,0163 0,0232 0,0158 Imputed Education, Calibration 0,0193 0,0261 0,0148 0,0158 Weights 2 Education 0,0291 0,0247 0,0169 0,0232 No Education 0,0291 0,0258 0,0190 0,0243 Imputed Education 0,0291 0,0240 0,0232 0,0232 Imputed Education, Calibration 0,0300 0,0243 0,0242 0,0232 Weights 3 Education 0,0321 0,0277 0,0162 0,0250 No Education 0,0321 0,0286 0,0185 0,0263 Imputed Education 0,0321 0,0273 0,0224 0,0257 Imputed Education, Calibration 0,0334 0,0231 0,0279 0,0257 Weights 4 Education 0,0455 0,0379 0,0193 0,0337 No Education 0,0455 0,0396 0,0231 0,0366 Imputed Education 0,0455 0,0332 0,0250 0,0321 Imputed Education, Calibration 0,0614 0,0281 0,0473 0,0321 Weights 5 Education 0,0260 0,0218 0,0167 0,0210 No Education 0,0260 0,0227 0,0191 0,0220 Imputed Education 0,0260 0,0209 0,0230 0,0206 Imputed Education, Calibration 0,0272 0,0243 0,0215 0,0206 Weights 6 Education 0,0192 0,0162 0,0171 0,0159 No Education 0,0192 0,0168 0,0190 0,0165 Imputed Education 0,0192 0,0158 0,0236 0,0156 Imputed Education, Calibration 0,0197 0,0244 0,0161 0,0156 Weights 38 Education 0,0309 0,0260 0,0169 0,0245 No Education 0,0309 0,0271 0,0187 0,0256 Imputed Education 0,0309 0,0253 0,0233 0,0245 Imputed Education, Calibration 0,0318 0,0249 0,0250 0,0245 Weights 39 Education 0,0204 0,0169 0,0161 0,0167 No Education 0,0204 0,0178 0,0185 0,0175 Imputed Education 0,0204 0,0166 0,0223 0,0166 Imputed Education, Calibration 0,0207 0,0231 0,0165 0,0166 Weights 40 Education 0,0205 0,0172 0,0166 0,0170 No Education 0,0205 0,0179 0,0191 0,0177

33

Page 122 of 199 Imputed Education 0,0205 0,0168 0,0230 0,0167 Imputed Education, Calibration 0,0210 0,0225 0,0179 0,0167 Weights 41 Education 0,0256 0,0218 0,0171 0,0209 No Education 0,0256 0,0224 0,0188 0,0218 Imputed Education 0,0256 0,0214 0,0236 0,0207 Imputed Education, Calibration 0,0269 0,0224 0,0240 0,0207 Weights 42 Education 0,0441 0,0371 0,0188 0,0327 No Education 0,0441 0,0389 0,0230 0,0357 Imputed Education 0,0441 0,0327 0,0245 0,0307 Imputed Education, Calibration 0,0609 0,0255 0,0544 0,0307 Weights Source: Own calculations

34

Page 123 of 199 Quality Assessment of Register-Based Statistics – Preliminary Results for the Austrian Census 2011 Predrag Cetkovi´ c´1, Stefan Humer1, Manuela Lenk2, Mathias Moser1, Henrik Rechta2, Matthias Schnetzer1, Eliane Schwerer2 1 Vienna University of Economics and Business (WU), Augasse 2-6, 1090 Vienna, Austria. 2 Statistics Austria, Guglgasse 13, 1110 Vienna, Austria. [email protected]

Abstract The present paper investigates the quality of register data in the context of a standardized quality framework. The focus lies on the assessment of the quality of derived attributes. Such attributes are of high importance for the register-based census in Austria. In order to get a quality measure for the necessary attributes of the census, we have to check the accuracy of the register data. Among other things, the congruency of data between the registers and a comparison data source have to be examined. This may lead to complications in the case of derived attributes, since there may be no data available, which could be used directly for comparison with the register data. Therefore, we have to consider alternative methods in applying our quality framework for derived attributes.

Keywords: administrative data, register-based census, derived attributes.

1. Introduction

Administrative records have become more important for statistical analyses in recent years. The use of administrative data sources has a long tradition in Scandinavian countries and is applied extensively for statistical purposes. One major application is, for example, the register-based census. Administrative data have several advantages over standard surveys. For example, they are already recorded and reduce the statistical burden of respondents significantly. On the contrary, the quality of administrative data heavily depends on the data provider. In general, national statistical institutions (NSIs) have little information on the accuracy and reliability of these data. Since Austria, among other countries, will carry out its first register-based census in 2011, it is a central task to assess administrative registers and to evaluate their quality. Quality assessment of register data has to fulfill several properties like transparency, accuracy or feasibility. To achieve these goals, we set up a general framework, which makes it possible to evaluate the quality of registers with regard to all available information. The present paper deals with the application of this quality framework for the case of derived attributes. These attributes are of high importance, because it is possible that none of the available registers contains an attribute, which is necessary for the register-based census. In this case, related attributes, which could be used for the derivation of the relevant attribute, have to be found. Since a relevant attribute may be derived from several raw data attributes, we would have to check the accuracy of all raw data information. Thus, appropriate comparison data for each raw data attribute should be available in order to check for congruency of data between the registers and a comparison data source. If there is no such comparison data available, we would rely on expert opinions. Since expert opinions may be associated with problems of subjectivity, we

Page 124 of 199 consider an alternative method, where only the congruency between the derived attribute itself (data in the Census Database) and the comparison data is checked. The derived attribute we have analyzed in this paper, is the current activity status. For this attribute it is also possible to check the congruency between registers, which were used in the derivation process, and the comparison source. Thus, we are able to compare the results of both methods in order to check for possible discrepancies between the two alternatives. The remainder of the paper is structured as follows. Section 2 gives a general overview of the quality framework and explains its most important elements. The application of the quality framework for analyzing derived attributes is explained in detail in section 3. Section 4 then shows the results of the quality assessment of the attribute current activity status. The last section concludes.

2. Quality Framework

Statistical data quality can be covered by several dimensions like timeliness or accuracy (Eurostat, 2003a). This also applies to administrative data as has been stressed by Eurostat (2003b). There is only few literature which deals with quality assessment of administrative data sources. Some national statistical institutions, like Statistics Finland, focus on the comparison between administrative and survey data (Ruotsalainen, 2008). Other countries, for example the Netherlands, take a more structural approach (Daas et al., 2009). Their aim is to cover the quality of different registers in a framework using different dimensions to assess data quality and accuracy. They developed a checklist for the quality evaluation of administrative data sources, which is structured in three different hyperdimensions of quality aspects. Our approach is an extension of the framework proposed by Daas et al. (2009) and it contributes a framework for structural assessment of administrative data to the field of quality research. This allows both the NSI and external researchers to assess the data sources they use. In our quality framework we focus on data accuracy, since this is the most challenging dimension. Moreover, accuracy is essential for the quality of the register-based census and is at the same time a major unknown property of register data. Quantification of data accuracy is realized by a framework, which is closely tied to the data flow, but independent from data processing. This is necessary since results of the quality assessment must not influence but evaluate the processing procedure. Whether low quality ratings lead to a revision of the data processing steps has to be determined for each statistical application independently. Experience from the test census suggests that this is not a major concern for the Austrian Census, since data quality is expected to be fairly high (Lenk, 2008). The quality framework, which is shown in Figure 1, is linked to the data flow on three different levels. In a first step, Statistics Austria receives the raw data (henceforth registers, see boxes on the left-hand side in Figure 1). In the next step, these different sources are combined to data cubes, the Census Database (CDB), by using unique IDs. These cubes solely include information available from the registers (raw data). Finally, we enrich the CDB with imputations of item non-responses. These steps result in a Final Data Pool (FDP), which consists of both real and estimated values. In each of these three steps (Registers, CDB and FDP) the data flow is linked to the quality assessment, so that changes can be monitored from a quality perspective. As a result, exactly one quality indicator for each attribute in each register or data pool is calculated (qij in Figure 1).

Page 125 of 199 Figure 1: Quality Framework

Reg 1 Census Database Ψ Final Data Pool Ω q1A A qΨA qΩA q A  A A HDD HDP HDE

q1B HDE B

qΨC qΩC HDD HDP HDE q C  C C q1C C

D P E HD HD HD qΨF qΩF q F  F F Reg 2 HDE

A q2A qΩG q G G G  q HDD HDP HDE ΨG

HDE D q2D

D P E HD HD HD qij Quality Indicator A Attribute Multiple Attributes E HDI Unique Attribute q2E Derived Attribute HDD Documentation P Pre–processing HDD HDP HDE HD HDE External Sources HDI Imputations

Abbildung 1: Quality Assessment of the Final Data Pool The quality assessment of the registers consists of three hyperdimensions: Documentation (HDD), Pre-processing (HDP ) and External Source (HDE). The first hyperdimension, HDD, includes all quality related aspects prior to seeing the data. Such aspects are, for example, plausibility checks, data collection methods or legal enforcements of data recording by the provider of the administrative data. Thus, it is a measure of the degree of confidence we put in the data provider. HDD is realized through a questionnaire which is filled out in accordance with the register authority. For each question there is a maximum score that can be obtained. Summing up the score for each question and comparing this sum to the maximum score leads to the quality indicator

obtained score HDD : (1) maximum score The second aspect of the quality framework, HDP , is concerned with formal errors in the raw data. Thus, it checks for definition and range errors, as well as missing primary keys and item non-responses. Usable records are therefore calculated by subtracting all incorrect entries from the total number of observations. The quality measure for the hyperdimesion Pre-Processing is given by:

number of usable records HDP : (2) total number of records In the last step we then investigate the congruency of the data by comparing it to an external source (HDE). This is primarily done using existing surveys (i.e. the Austrian

Page 126 of 199 Microcensus). The Microcensus is an appropriate comparison source, because we can link its data via a unique key with the data in the registers or the CDB in order to compare the values and check for consistency on the unit level. As a result, we get the quality measure

number of consistent values HDE : (3) total number of linked records If an attribute is not found in the Microcensus, we rely on expert opinions. The expert is a person at Statistics Austria, who is responsible for the administrative register and therefore has experience with the quality of the data. For further information on the three hyperdimensions see Berka et al. (2010). The quality indicator qij on register level results from a weighted combination of the three hyperdimensions. Thus, appropriate weights, which resemble the relative importance of each hyperdimension, have to be chosen. In a further step, we can use the quality indicators to assess the quality of the data in the Census Database. In comparing the CDB with the raw data registers, we can generally distinguish three cases: a) a single comparison register is available (see Figure 1, attribute C), b) multiple registers to compare with (see Figure 1, attribute A) and c) no raw data register with a similar attribute is disposable (see Figure 1, attributes F and G). Case a) is trivial to assess, since the confidence we put in the CDB is simply qij, which is the quality indicator for the specific attribute j in register i. A unique attribute is, for example, the level of education. For multiple attributes (e.g. sex), a specific method must be applied in order to deal with quality indicators from different data sources. This is most important in cases where the information differ between these data sources. In this case, the Dempster- Shafer theory is an appropriate method to assess the quality of the data (Dempster, 1968; Shafer, 1992). A detailed investigation of the quality of multiple attributes is provided in Berka et al. (forthcoming). The case of derived attributes (e.g. current activity status) is subject to the present paper and will be explained in detail in the following sections.

3. Quality assessment of derived attributes

Derived attributes are such, for which the registers do not contain any information in the required specification. However, if the raw data contain attributes, which are related to those we are looking for, they could be used for the derivation of the latter (e.g. attribute F in Figure 1). Such an attribute is, for example, the current activity status, which can be derived from various registers, like the Unemployment Register or the Central Social Security Register. Moreover, a relevant attribute may also be derived from an attribute in the CDB (e.g. attribute G in Figure 1). This may be necessary, if there is no information on raw data level, which could be used directly for the derivation of the relevant attribute. An example for this type of attribute is the occupation, which is derived on CDB level (among other) from the current activity status, which is a derived attribute itself. As has been mentioned, more than one register may be used for the derivation of a specific attribute. Thus, if the number of used registers gets large, we would have to assess the quality for a high number of attributes used in the derivation process. Apart from the extended number of applications, no further problems will arise for the hyperdimensions

Page 127 of 199 Documentation and Pre-processing. By contrast, the hyperdimension External Source may lead to further complications, since the congruency of data between all used registers and the comparison source has to be checked. Particularly cases, where no appropriate information for each raw data attribute can be found in the primary comparison source (Austrian Microcensus), will be associated with additional problems and will require other external sources. Alternative external sources are, for example, expert opinions. However, since expert opinions may suffer from subjectivity, the reliability of this type of external source could be questioned. Additionally, such expert interviews would be associated with an increased work effort. In order to deal with these shortcomings, we consider an alternative method, which only differs to the first one with respect to the application of the hyperdimension External Source. The first method of assessing the quality measure for derived attributes is shown in Figure 2. The three hyperdimensions are all applied on raw data. This results in the quality indicators q1B, q2D and q2E for the attributes B, D and E respectively. A weighted combination of the three quality indicators will then lead to the quality indicator for the derived attribute (q F ). It may also be necessary to assess the errors of the derivation process itself. Therefore, it can be helpful to check the validity of the derived attribute E using an external source (HD for the attribute F in Figure 2). A combination of q F with E the hyperdimension HD on CDB level leads to the quality indicator qΨF . However, since for the purpose of this paper we are interested in the quality measure q F , the indicator qΨF has not been calculated for the first method.

Figure 2: Derived attributes, method a

In the second method, only the hyperdimensions Documentation and Pre-processing are done on raw data level, while the hyperdimension External Source is applied on CDB level (Figure 3). Since there is no direct measure of HDE for raw data, the quality indicators for the attributes on raw data level, q1B, q2D and q2E, can not be assessed. This is due to the possibility that different attributes in the CDB may be derived from the same raw data attribute. As the hyperdimension External Source is applied on CDB level, a variety of HDE-quality indicators would be available for the same raw data attribute. Thus, an assignment of the calculated quality measures to raw data would lead to ambiguous results. However, it is possible to assess the quality indicator for the derived D P attribute (qΨF ), which is calculated by a weighted combination of HD and HD on raw data level with HDE on CDB level.

Page 128 of 199 Figure 3: Derived attributes, method b

4. Results

For the Austrian register-based census, many attributes are of the nature of derived attributes. The first attribute of this type we deal with, is the current activity status. For the derivation of this attribute, we use several registers. Since none of these registers contain the current activity status in the required specification, related attributes have to be found. Additionally, the specification of these related attributes differs from register to register, so that we end up with 8 different attributes, each of which is included in a separate register. The used registers are the Central Social Security Register (CSSR), the Unemployment Register (UR), the Register of Social Welfare Recipients (RSWR), the Data of the Federal Chambers (FC), the Registers of Public Servants of the Federal State and the Laender (RPS), the Conscription Register (CR), the Tax Register (TR) and the Register of Enrolled Pupils and Students (REPS). As has been mentioned in Section 3, the application of the hyperdimension HDE may lead to complications for the case of derived attributes. This is due to the possibility that the primary comparison source, the Austrian Microcensus, may either not contain all attributes, which are necessary for comparison or the specification in the Microcensus does not fit with the specification of raw data. For the current activity status, it was possible to find an attribute in the Microcensus (activity status), which could be used for comparison with the current activity status on CDB level and all attributes on raw data level except the data from the Register of Enrolled Pupils and Students. The latter was therefore compared with an other attribute in the Microcensus (participation in education). However, it was necessary to respecify the relevant raw data attributes as well as the current activity status, so that they could be compared with Microcensus data.1 The eventual categories for the REPS are: currently in education and currently not in education. All other raw data attributes as well as the current activity status itself have been classified into the following specifications: employed, unemployed, not

1 The Register of Enrolled Pupils and Students contains only persons currently in education. The attribute participation in education from the Microcensus is classified into the categories currently in education and not currently in education. The other raw data attributes have in sum more than 1,100 specifications. By applying a ruleset, Statistics Austria reduces these different categories to about 40. In a further step, we reduce these 40 classes to 5 in order to make a comparison with Microcensus data possible.

Page 129 of 199 economically active, military and civil servants and persons under 15 years.2 The whole population in the Central Database consists of all unique entries in the Central Population Register (CPR). The current activity status in the CDB is derived by using a predefined ruleset, where each applied register contributes to a different degree in the derivation process. The applied ruleset is in accordance with international standards. In order to get an overall quality measure for the attribute current activity status, the quality indicators of the relevant raw data attributes have to be weighted by their contribution to the derivation of the current activity status. These contribution shares are shown in Table 1, where it can be seen that the current activity status has been derived in most cases from the Central Social Security Register (77.18% of all CDB entries). Because of a lack of data in other registers, 5.02% of the entries in the CDB have been derived from data in the Central Population Register (last column in Table 1).

Table 1: Shares of registers in the derivation of the current activity status in per cent

CSSRURRSWRFCRPSCRTRREPSCPR

wi 77.18 3.15 0.79 0.09 0.18 0.18 0.24 13.17 5.02

The results of the first method, where all three hyperdimensions are applied on raw data level, is shown in Table 2. As can be seen, the hyperdimension Documentation shows a high variability between the registers. It should be mentioned here that this hyperdimension has been hitherto conducted only for the Central Social Security Register, the Unemployment Register and the Register of Enrolled Pupils and Students. The values for the remaining registers are therefore approximated values. The hyperdimension Pre- Processing assigns a high quality to all attributes. Because there are in general only a few items, which do not have an unique ID, the measure for HDP is in most cases slightly less than one. By contrast, the raw data do not suffer from item non-responses or out of range-values. According to the hyperdimension External Source, raw data is in most cases consistent with data in the Microcensus. However, with a value of 0.38, the attribute from the Unemployment Register has a very low quality when it is applied for the derivation of the current activity status. This is probably due to the different definition of unemployment between the Unemployment Register and the Microcensus.3 As the Central Population register does not contain any information regarding the current activity status, those entries, which have been derived from the CPR have been defined as not economically active and their quality indicator has been set to 0. The three hyperdimensions have been equally weighted by 1/3. The combination of the quality

2 Persons under 15 years are not directly surveyed in the Austrian Microcensus regarding the attribute activity status. Thus, for the application of the hyperdimension External Source, these persons have been dropped out of Microcensus data. As persons under 15 years are not highly represented in most registers, dropping out this group will not really influence the results. 3 In comparing the Unemployment Register with the Microcensus, 1,539 persons could be linked for the 4th quarter 2009. In the Unemployment Register, 1,216 out of these 1,539 cases have the status unemployed. The remaining cases are mostly persons, which participate in job-training courses and thus are not counted as unemployed. From the 1,216 cases, which are unemployed according to the Unemployment Register, only 481 are also declared as unemployed in the Microcensus, whereas 354 persons are considered as employed and 381 as not economically active.

Page 130 of 199 Table 2: Results, method a

D P E Register HDi HDi HDi qij CSSR 0.86 0.97 0.92 0.92 UR 0.62 1.00 0.38 0.67 RSWR 0.93 0.99 0.91 0.94 FC 0.38 0.98 0.95 0.77 RPS 1.00 0.98 0.97 0.98 CR 0.88 1.00 0.77 0.88 TR 0.79 0.96 0.95 0.90 REPS 0.86 0.98 0.83 0.89

indicators qij of the raw data attributes (by using weights wi) results in the quality measure for the attribute current activity status (q current activity status), which has a value of 0.862.

X X 1 D 1 P 1 E q F = (qij ∗ wi) = [( HDi + HDi + HDi ) ∗ wi] = 0.862 (4) 3 3 3 Table 3 shows the results for the second method. The values for the hyperdimensions Documentation and Pre-Processing as well as the weights for the three hyperdimensions are the same as in the first method. The hyperdimension External Source, which has been assessed for the attribute current activity status in the Central Database, has a high quality. As a consequence, the quality indicator of four registers would be improved in comparison to the first alternative. This is particularly true for the Unemployment Register, where the quality indicator now would be 0.85, compared to 0.67 in the first method. However, as the hyperdimension has been done on CDB level, we can not really assign these quality indicators to the registers (see Section 3 for an explanation). The weighted quality measure for the current activity status (qΨcurrent activity status) now has a value of 0.872, which is slightly higher than in the first method.

Table 3: Results, method b

D P E Register HDi HDi HDΨ (qij) CSSR 0.86 0.97 0.92 0.92 UR 0.62 1.00 0.92 0.85 RSWR 0.93 0.99 0.92 0.95 FC 0.38 0.98 0.92 0.76 RPS 1.00 0.98 0.92 0.97 CR 0.88 1.00 0.92 0.93 TR 0.79 0.96 0.92 0.89 REPS 0.86 0.98 0.92 0.92

X 1 1 1 q = [( HDD + HDP + HDE) ∗ w ] = 0.872 (5) ΨF 3 i 3 i 3 Ψ i

Page 131 of 199 5. Conclusion

In this paper we investigated the quality of administrative data for the purpose of applying these data for the register-based census. A general quality framework was adapted in order to deal with derived attributes, which are of high importance for the census. For this purpose, two different methods have been carried out. The first method does the whole quality assessment (all three hyperdimensions) on raw data, whereas the second method shifts the hyperdimension External Source to the data in the Census Database. The first derived attribute we have dealt with is the current activity status. In order to get this attribute, related attributes from 8 different registers have been used. The results for the first method show that most of the used raw data attributes have a high quality measure. Thus, the overall quality indicator for the current activity status is 0.862, which is fairly high. If the second method is applied, the quality indicator for the current activity status increases slightly to 0.872. The similarity of the results of the two alternatives indicates that there are no problems in applying the second method for the quality assessment of the attribute current activity status. This is a positive finding, because the hyperdimension External Source has to be done only for the derived attribute and not for all raw data attributes. Thus, complications associated with non-availability of comparison data or subjectivity of potential expert opinions are reduced. However, the positive result for the current activity status does not guarantee that the application of the two alternative methods would lead to the same conclusion for other derived attributes.

References

Berka C., Humer S., Lenk M., Moser M., Rechta H., Schwerer E. (2010) A quality framework for statistics based on administrative data sources using the example of the Austrian census 2011, Austrian Journal of Statistics, 39. Berka C., Humer S., Lenk M., Moser M., Rechta H., Schwerer E. (forthcoming) Combination of evidence from mulitiple administrative data sources – Quality assessment of the Austrian register-based census 2011, Statistica Neerlandica. Daas P., Ossen S., Vis-Visschers R., Arends-Toth´ J. (2009) Checklist for the quality evaluation of administrative data sources, Statistics Netherlands Discussion Paper. Dempster A. (1968) A generalization of bayesian inference, Journal of the Royal Statistical Society. Series B (Methodological), 30, 205–247. Eurostat (2003a) Item 4.2: Methodological Documents – Definition of Quality in Statistics, in: Working group assessment of quality in statistics. Eurostat (2003b) Quality assessment of administrative data for statistical purposes, in: Assessment of quality in statistics. Lenk M. (2008) Methods of register-based census in Austria, Statistics Austria Tech. Rep. Ruotsalainen K. (2008) Finnish register-based census system, Statistics Finland Tech. Rep. Shafer G. (1992) Dempster-Shafer Theory, in: Encyclopedia of artificial intelligence, Shapiro S (Ed.), Wiley, 330–331.

Page 132 of 199 The integration of the Spanish labour force survey with the administrative source of persons with disabilities

Amelia Fresneda INE, [email protected]

Abstract: The “Employment of persons with disabilities” investigates the situation, with regard to the labour market, of the group of persons between the ages of 16 and 64 years old who hold disability certificates. This operation provides data regarding the labour force (employed persons, unemployed persons) and for the population outside the labour market (inactive persons) within the group of persons with disabilities. It is formed as a periodic operation of an annual nature that uses information deriving from integrating statistical data supplied by the Economically Active Population Survey (EAPS, that is the Spanish Labour Force Survey, LFS) with the administrative data recorded in the State Database of Persons with Disabilities (SDPD).

1. Introduction

The “Employment of persons with disabilities” investigates the situation, with regard to the labour market, of the group of persons between the ages of 16 and 64 years old who hold disability certificates.

This operation provides data regarding the labour force (employed persons, unemployed persons) and for the population outside the labour market (inactive persons) within the group of persons with disabilities.

It is formed as a periodic operation of an annual nature that uses information deriving from integrating statistical data supplied by the Economically Active Population Survey (EAPS, that is the Spanish Labour Force Survey, LFS) with the administrative data recorded in the State Database of Persons with Disabilities (SDPD).

1.1 Necessity of information

The group of persons with disabilities has formed an axis for priority action in social policies carried out in recent years in order to achieve integration of these persons in the workplace.

In particular, it is an essential point of interest for:

- The Spanish Committee of Representatives of Persons with Disabilities (CERMI) that is the Spanish umbrella organisation representing the interests of more than 3.8 million women and men with disabilities in Spain. The mission of CERMI is to guarantee equal opportunities of women and men with disabilities and to protect their human rights, ensuring they are fully included in society.

Page 133 of 199

- The ONCE Foundation, whose main objective is to implement integration programmes of work-related training and employment for people with disabilities, and universal accessibility, promoting the creation of universally accessible environments, products and services.

- The Elderly and Social Services Institute (IMSERSO) that is the Social Security Administration Body responsible for handling Social Services supplementing Social Security System provisions and which deals with older and dependent persons.

- The Spanish National Institute of Statistics (INE), that, as coordinator of the official statistics, has the mission of completing the lack of information and of promoting the utilization of administrative sources to produce data without increasing the budget and without overburdening the informants.

In 2009, the available information on Spanish employment and disability was:

1. SDIH-1986: the Survey on Disabilities, Impairment and Handicaps 2. SDIHS-1999: The Survey on Disabilities, Impairments and Health Status 3. Ad-hoc Module-2002 on employment of disabled people for the labour force sample survey provided for by Council Regulation 4. DIDSS-2008: Disabilities, Independence and Dependency Situations Survey.

These surveys are carried out in wide, irregular and infrequent periods, so that a continuous monitoring of the situations of the persons with disabilities cannot be performed. But information is required for evaluating the effectiveness of the policies and the current situation of persons with disabilities, mainly in 2008, year in which there is special interest in measuring the crisis effects.

1.2 Looking for a solution

On June 2009, a meeting between INE, CERMI and ONCE-foundation took place. Both organizations demand to INE periodic information about the employment of persons with disabilities, and to obtain it, they suggest to introduce the variable “disability” into all the social surveys.

INE shows several considerations against this proposal:

- On one hand, he reduction of the burden to the informants is a primary objective for the institute.

- On the other hand, the study of any phenomenon requires a set of questions in the questionnaires to assure the correct definition of the concept. So that an overburdening in the questionnaires is required. But it would provide a lack of quality of the responses, mainly in the EAPS that have a big questionnaire with a great deal of questions (labour, familiar relations, education, annual modules…)

In this situation INE and CERMI propose looking for alternative ways. There are two possible sources of information:

Page 134 of 199

- The Economically Active Population Survey (EAPS): It is the Spanish Labour Force Survey (LFS) and involves the most complete source of information about the situation of the labour market. The main objective of the EAPS is to reveal information on economic activities as regards their human component.

- The State Database of Persons with Disabilities (SDPD): It is a registration system, with national scope, for persons with disabilities. It provides information regarding the features of citizens who have officially been recognised as persons with disabilities by the State administrative bodies with jurisdiction. It is managed by the IMSERSO.

Therefore it is decided that the best solution to get information is to use information deriving from integrating statistical data supplied by the EAPS with the administrative data recorded SDPD.

This solution will permit getting the required information without overburdening the informants and without increasing the INE budget (given that the introduction of this system would be partially financed by the ONCE Foundation).

1.3 Working group

For accomplishing this new register-based statistical operation regarding employment and disability, the CERMI, ONCE-Foundation, IMSERSO and INE have created a working group with the aim of sharing work, knowledge and experience.

During 2010 this working has evaluated 2008-data to study the feasibility of getting relevant, reliable and periodical information on persons with disability and their labour market relations on the base of merging the EAPS with the SDPD. So that along this year, “The employment of the persons with disabilities 2008” is developed as pilot study for evaluating the possibility of getting information on “Disability and Employment” by crossing 2008-EAPS versus 2006-SDPD (and previously 2008-EAPS with the POPULATION REGISTER from INE, to assign the identification number to the EAPS registers).

The IMSERSO has provided the 2006-SDPD that at that moment was the latest available update.

The CERMI contributes with its knowledge, experience and technical support.

The ONCE-Foundation, besides technical support as CERMI, provides financing for hiring supporting staff.

The INE supplies the 2008-EAPS data and assumes the technical and operative tasks for establishing the methodological bases and the procedures to get periodical results.

The success of this study has led INE to establish it as a periodic operation of an annual nature and to extend it by merging with other administrative sources (Social Security, Pensions and Dependency)

Page 135 of 199

2. Description of the project

2.1 Description of the objectives

The overall objective of the project is to fulfil the demand of information on the situation of the persons with disabilities with respect to the labour market.

The focuses of the study are: • To estimate the number of employed, unemployed and inactive persons inside the collective of persons with disabilities, as well as their comparison with the persons without disabilities. • To carry out the analysis of the disability and labour market from the perspective of gender. • To implement the analysis of the participation in the labour market from the perspective of the type of impairment. Also, there are complementary objectives derived from the available information from EAPS and SDPD data: • To obtain the evolution and variation along the time of the number of active/inactive persons with disability and their comparison with the evolution inside the persons without disability. • To ascertain the characteristics of persons with disabilities relating their personal, familiar and geographical features with their labour market situation. • To determine the features of the type of disability and its severity (SDPD variables) versus the employment and households variables (EAPS variables).

2.2 Characteristics

2.2.1. DEFINITIONS (established by the sources of information used)

The variables employed/unemployed/inactive are defined by the LFS regulation. The variable disability is defined by the Spanish legislation.

- Employed population: All persons 16 years old and older who, during the reference week: a) Either worked for at least one hour, even sporadically or occasionally, in exchange for a salary, wages or another form of remuneration in cash or in kind. b) Or were employed but not working (due to illness, holidays, leaves, work conflicts, bad weather…) - Unemployed population: Those persons 16 years old or older who combine the following conditions simultaneously: a) Without work b) Seeking work c) Available to work - Economically active population: Employed and unemployed population

- Inactive population: Those persons 16 years old or older who in the reference week cannot be classified neither employed nor unemployed.

Page 136 of 199 - Disability: It is a wide concept that can be analysed from several points of view.

For the Word Health Organization (WHO) disability is considered an umbrella term encompassing impairments, limitations of activity and restrictions of participation. The WHO expands the concept of health incorporating environmental factors (physical, social and attitudinal environment in which the persons live and carry out their lives). But from the legal point of view, there is an administrative procedure, defined by regulation[1] in which persons with disability voluntarily participate to be evaluated by qualified equipment that determine the degree of disability in function of different parameters defined by law. The evaluated persons that have a degree of disability equal or bigger than 33% are those who are “legality” considered as persons with disabilities and who receive an official certificate of disability. The SDPD includes all the persons who have been evaluated through this procedure. As the project is developed over the base of the SDPD, the definition of disability is delimited by the legal definition used for the SDPD, that meanly check for illnesses and impairments and it is different from the WHO recommendation, though it is important to remark that in many cases both definitions coincides.

2.2.2. SCOPE (established by the source of information used) The scope for the project is delimited by the EAPS and the SDPD scopes. - EAPS scope and sampling: The EAPS is a quarterly survey whose scope considers the population living in family dwellings, excluding group or collective dwellings (hospitals, residences, barracks, etc.) and secondary or seasonal dwellings (used during holiday periods, at weekends, etc.) The survey uses a two-stage sampling with first stage unit stratification. The First stage units are the census sections (areas established for electoral purposes). The second stage units are family dwellings: information is collected on all persons regularly living in the dwelling. The total sample, formed by around 65.000 households, is divided into six subsamples. Family dwellings are renovated partially every quarter of the survey, in order to avoid tiring the families. Each quarter, the dwellings in the sections of a specific subsample are renewed. - The SDPD scope: Only includes people that have freely and voluntarily asked the evaluation procedure. In compensation, it is a census that includes the whole population with legally and officially recognized disability, and it has a high degree of liability because, by law, the evaluating equipment is formed by doctors, psychologists and social workers. -“The employment of persons with disabilities” scope: because of the nature of the project, it is defined by EAPS and SDPD scopes (defined above), so it includes:

- Persons with certificated of disability - Persons living in family dwellings (not collective dwellings) - Persons between 16 and 64 years old. 3. Treatment of the information

[1] “Real Decreto 1971/1999, de 23 de diciembre, de procedimiento para el reconocimiento, declaración y calificación del grado de discapacidad.Real Decreto 1856/2009, de 4 de diciembre, de procedimiento para el reconocimiento, declaración y calificación del grado de discapacidad, y por el que se modifica el Real Decreto 1971/1999, de 23 de diciembre”.

Page 137 of 199 The objective is to link EAPS and the SDPD to get information on “Disability and Employment”. The information is going to be joined at microdata level. It means that every unit in the EAPS sample is completed with information about disability (degree, impairment, severity) from the SDPD, so that the EAPS estimators could be applied either to the collective of persons with disabilities or to the population without disabilities. Unfortunately, when integrating data from different sources, several problems and conflict may appear: Firstly, SDPD data lack a unique and global identifier that permits the linking operation. Furthermore, the date are neither carefully controlled by many factors, including data entry errors, lack of standard format, incomplete information or any combination of these factors. Finally, as the initial sample was chosen to cover the EAPS objective but not for estimating data on disability, as the disability is a phenomenon that affect to a small percentage of the population (2,8% of population between 16 and 64 years old), the weights should be readjusted to get reliable estimations on disability and employment. So to get the objective of the project, several tasks should be undertaken: - Data preparation - Field matching - Choose appropriated weights

3.1. Data preparation - EAPS data: The survey data are consistent and correct, because they have been previously checked. The only unresolved question in EAPS is the lack of an identifier for each person. There is a specific project in INE (managed by EAPS area) that has the mission to assign identifiers to EAPS fields by matching the survey records with the Population Register. So these identifiers are provided on time to undertake the project “The employment of the persons with disabilities”. In summary, EAPS information is prepared to be linked with other sources

- SDPD data: Data from SDPD must be standardized and cleaned. In particular the outstanding task for preparing the SDPD link with EAPS are:

3.1.1. Standardizing identifiers (in sdpd data):

The SDPD identifier (IDEN) is recorded in a not standard format. For standardizing it, an unique number of order is assigned to each field in the SDPD. After that, every numerical code inside the not standard identifier is extracted from the SDPD and considered as a candidate for crossing with EAPS and with the Population Register[2] (PR). Hence, every SDPD field can have cero, one or more candidate to be standard identifiers and all of them are used for the joins with different sources.

[2] The Population Register (PR) is the administrative register in which inhabitants are recorded. Its data constitutes proof of residence in the municipality. Everybody who resides in Spain is obliged to register in the Municipal Register in which they habitually reside. Anyone who lives in several municipalities will have to register only in the one in which they spend more time in over the year. Amongst others, the PR includes the variables identifier (Identity card, Foreing identity number or passport), name and surname, direction, incidences (that include death), type of dwelling (normal/collective).

Page 138 of 199 These joins are made taking into account not only the standard identifier candidates but also the birth date, the presence of duplications and the locality. When the join is accepted, then the standard candidate that has been matched is chosen as definitive and it is stored for futures queries or joins (COD_FIN).

OBTENTION OF AN STANDARD IDENTIFIER Number of Not standard Standard identifier (Chosen standard

order identifier (cross candidates) identifier ) NORDER IDEN NUM1 NUM2 NUM3 COD_FIN 0100000001 ‘LLLLLLL’ 0 0100000002 ‘0’ 00000000 0100000003 ‘12345678L’ 12345678 12345678 0100000004 ‘123/12345678L’ 00000123 12345678 12345678 Crosses 0100000005 ‘12345/1234L’ 00123456 00001234 None crosses

3.1.2. Checking duplications (in sdpd data):

The SDPD data are managed by the autonomous communities (Spanish regions). There can be cases in which the same person with disability would be evaluated in two different moments in two different autonomous communities. These cases are duplicated registers in the SDPD. For detecting duplications, it should be taken into account as the standard as the non standard identifier together with the birth date (BD):

DUPLICATED BY IDEN & BIRTH DAY (BD) NORDEN REG IDEN BD 3900000306 06 2707608 19570723 3900000307 06 2707608 19570723 1100000579 01 D015382294 19640107 2100000284 01 D015382294 19640107 3300032878 03 D015382294 19640107

DUPLICATED BY COD_FIN & BD NORDEN REG IDEN COD_FIN BD 3600021773 12 X00104798 104798 19520214 3600021786 12 X0104798 104798 19520214 0300000057 10 377512 377512 19530622 1800000018 01 D000377512 377512 19530622 2800003408 13 00377512J 377512 19530622

DUPLICATED BY COD_FIN or IDEN & BD NORDEN REG IDEN COD_FIN BD 0400008988 01 D031789529 31789529 19450328 1200008515 10 31789529 31789529 19450328 4600048258 10 31789529 31789529 19450328 1100016693 01 D031789529 . 19450328

Page 139 of 199

To solve duplications, the next ordered rules has been disposed: It is chosen the register whose… 1. SDPD location coincides with the update PR location 2. SDPD location coincides with the 2006-PR (2006 is the reference period for SDPD. June the 31st, is the reference date used for 2008-PR) 3. SDPD location coincides with the birth location in the PR. 4. SDPD province coincides with the PR province 5. Date of update in the SDPD is the most recent. 6. …The rest of duplications are randomly solved. Note: the latest update is used in the last place (before random selection) because the variable ‘update’ in the SDPD is missing in most of the registers. With this process, only the 5,7% of the duplications are solved randomly and the solutions can be used from one year to another.

3.1.3. Checking valid values (in sdpd data):

The errors detected are: - No valid identifiers: In some cases the IDEN variable contains the birth date. These cases are detected and marked as possible wrong identifiers that should be treated carefully in the case of being matched with some other source. - Local standardization: The SDPD uses an old and obsolete system of codes for the variable locality (LOCAL). So INE assumes the task of translating this old-fashioned system to the update standard one. - Age and sex correction: For all the registers linked with the PR, the inconsistencies between the variables Birth Date (BD) and SEX are reviewed. a) If the register does not cross with the EAPS then it is given priority to the POPULATION REGISTER information versus the SDPD one. Besides, for SEX variable, the inconsistencies between SDPD and PR links are solved by kind or name: a list with the names that have inconsistencies is obtained and each name is classified in ‘female name’, ‘male name’, ‘unisex name’. For female /male names the sex is assigned in corresponding to the gender of the name. For unisex names a detail revision one by one permit to establish, in function of the relationships between the members of the household, the sex of the person. Finally, in case of doubt, PR value is chosen. b) If the register cross with EAPS then: b1) The information that coincides in two of the three sources is the chosen one. b2) If there is no coincidence in any of the three sources: Each disagreement is revised for taking a choice, always under the next order of priorities: first EAPS information, secondly POPULATION REGISTER information, and finally SDPD information.

3.1.4. Deleting registers:

Finally, there are three classes of registers that have been removed from the SDPD

Page 140 of 199 - Out of age-range deletions: After the sex and age variables are corrected, persons with age out of range [16,64] are deleted - Deleting dead persons: For all the registers linked with the PR, it can be determined if the person is alive. In case of dead the register is considered out of scope. - Removing persons living in collective dwellings: EAPS scope exclude people living in collective dwellings, so to keep consistency, the project should delete from SDPD people living in this kinds of dwellings. INE has a DIRECTORY of CENTERS that was utilised in the Disabilities, Independence and Dependency Situations 2008 Survey and that now it is crossed with the SDPD through the variable street-code. This code is assigned to the SDPD through a distance function between the SDPD literal for the street and the code-list literal for the street. Finally the cross is reviewed by using the PR value on collective/normal household.

CLEANING SDPD Total number of registers (initially) 1.048.838 Persons living in collective establishments 17.393 Removed duplicates 19.039 Deads (until 2008) 96.326 Age out of range (16-64) 33.537 Final number of registers 892.455

3.2 Matching

The objective is to merge the EAPS registers with the SDPD ones to obtain the subsample of persons with disabilities inside the sample of EAPS.

Four ways for matching these registers are considered:

3.2.1. Match 1: through the standard identifier

The SDPD registers which have a numerical code as standard identifier are merged with EAPS through this code (COD_FIN) For deciding if the matching is or not valid, the variables birth date (BD) and locality code (LOCAL) are taken into account as shown below

SUMMARY OF SDPD REGISTERS MATCHED WITH EAPS THROUGH CODFIN Freq Coincidences Description and treatment (by CODFIN-BD-LOCAL) 5.035 Coincidences by CODFIN Total or matches between BEPD and EAPS 3.380 CODFIN+AAAAMMDD+PPMMM => Matches correct: revised through a 752 CODFIN+AAAAMMDD+PP sample 222 CODFIN+AAAAMMDD 133 CODFIN+AAAAMM+PPMMM 30 CODFIN+AAAAMM+PP 11 CODFIN+AAAAMM 56 CODFIN+AAAA+PPMMM => Matches correct: revised

Page 141 of 199 Freq Coincidences Description and treatment (by CODFIN-BD-LOCAL) 24 CODFIN+AAAA+PP exhaustively 6 CODFIN+AAAA 54 CODFIN+AAA+PPMMM 13 CODFIN+AAA+PP 40 CODFIN+AAA 100 CODFIN+PPMMM => Familiar relations: revised exhaustively 37 CODFIN+PP (to find the correct CODFIN through the MR) 177 CODFIN => Not valid matches, rejected CODFIN= standard identifier BD= Birth date (year=AAAA+ month=MM + day=DD) LOCAL= code of locality (province=PP + locality=MMM)

3.2.2. Match 2: through the birth date and the locality code

The SDPD registers for which there are not standard identifiers, are investigated through the birth date (BD) and the locality code (LOCAL) as follows: a) For each pair (BD,LOCAL) in EAPS, it is obtained its frequency in this survey. b) The SDPD registers without standard identifier are merged with EAPS through the variables BD and LOCAL c) The cases from b) that have a frequency of (BD, LOCAL) in EAPS equal to 1 are investigated in the Population Register (PR). If in the PR the frequency of these cases is also 1, this means that in Spain there is only one person with this features and the match is correct. d) The cases from b) that have a frequency of (BD, LOCAL) in EAPS equal to 2 and correspond to persons of different sex are investigated in the Population Register. If in PR the frequency is also 2, the sex permit to distinguish which of them is the correct and again, this means that in Spain there is only one person with these features so the match is correct.

3.2.3. Match 3: through the standard identifier

The SDPD registers that have not a standard identifier are merged with EAPS through this non-standard code (PASSPORTS). In spite of being a small quantity of possibilities, some of them can result correct matches.

3.2.4. Match 4: investigation of strange identifier in small localities

For SDPD registers with “strange identifiers” that are situated in localities that are in EAPS sample and whose population is less than 30.000, can be investigated in the Population Register to determine if some of them really match with EAPS. Strange identifiers are those that in the SDPD have: - Identifier equal to missing value or - Birth date inside the identifier code or - Identifier that contains “O” instead “0” or - Identifiers with code numbers whose length is bigger than 8 digits (8 is the length of the standard identifier in Spain) As shown bellow, the principal part of the sample of persons with disabilities obtained from the merging between SDPD and EAPS is obtained through the standard identifier

Page 142 of 199 (match 1), but it is obvious that there is part of the sample that can be obtained from other kind of investigation and, as the sample size of persons with disabilities is small, it is worth to carry out all the explained methods.

EAPS ∩ SDPD SAMPLE MATCH 1 => 98,0 % of the total match MATCH 2 => 0,7 % of the total match MATCH 3 => 0,2 % of the total match MATCH 4 => 1,1 % of the total match

3.3 Estimation and weights

The expression of the EAPS estimator for a specific characteristic Y in a certain quarter of survey is as follows:

nh ˆ == ∑ hh ∑∑yFYFY hih h h =1i the sum h is extended to the strata of a province, an autonomous community or the national total, in which Fh: it is the weight for the stratum h nh: it is the number of persons in the sections of the sample in stratum h. yhi : it is the value of the characteristic researched in person i-th, of stratum h.

Disability is an atypical phenomenon than affects to a small percentage of the population and the EAPS is a survey designed to obtain labour market results, but not disability figures. So it is expected that the sample of persons with disability obtained from the integration of the SDPD with EAPS would be not enough.

For deducing if the sample size is enough, it is compared with a sub-scope from the EAPS for which quarterly results are delivered. The reference sub-scope that it is chosen is the “region”, that in Spain it is defined by autonomous community:

Sampling size (n) obtained to estimate the total population (N) QUARTERLY SAMPLE SIZE IN EAPS BY REGIONS

Page 143 of 199 Regions n n/N Total (Spain) 141.118 0,45% Andalucía 24.821 0,45% Aragón 6.305 0,73% Asturias (Principado de) 4.150 0,57% Balears (Illes) 3.384 0,46% Persons with disabilities Canarias 7.316 0,50% in QUARTERLY EAPS sample: Cantabria 3.825 0,97%

Castilla y León 14.303 0,86% nD=3.000 Castilla - La Mancha 10.216 0,77% 14.644 0,29% Cataluña nD/ ND = 0,33% Comunitat Valenciana 11.715 0,35% Extremadura 5.729 0,80% Galicia 9.686 0,52% Madrid (Comunidad de) 7.355 0,17% Murcia (Región de) 4.486 0,46% Navarra (Comunidad Foral de) 3.134 0,76% País Vasco 6.586 0,45% Rioja (La) 2.465 1,16% Ceuta (Ciudad Autónoma de) 552 1,14% Melilla (Ciudad Autónoma de) 446 0,97%

As shown in the figures, there are some regions whose estimations are quarterly provided and that have similar sample size to the persons with disability. ANNUAL SAMPLE SIZE IN EAPS BY REGIONS Regions n n/N Total (Spain) 141.118 0,45% Andalucía 24.821 0,45% Aragón 6.305 0,73% Asturias (Principado de) 4.150 0,57% Balears (Illes) 3.384 0,46% Persons with disabilities Canarias 7.316 0,50% in ANNUAL EAPS sample: Cantabria 3.825 0,97%

Castilla y León 14.303 0,86% nU=5.000 UNION= different Castilla - La Mancha 10.216 0,77% units 14.644 0,29% Cataluña nU/ ND = 0,56% Comunitat Valenciana 11.715 0,35%

Extremadura 5.729 0,80% nS=12.000 SUM= sum of Galicia 9.686 0,52% interviews 7.355 0,17% Madrid (Comunidad de) nS/ ND = 1,38% Murcia (Región de) 4.486 0,46% Navarra (Comunidad Foral de) 3.134 0,76% País Vasco 6.586 0,45% Rioja (La) 2.465 1,16% Ceuta (Ciudad Autónoma de) 552 1,14% Melilla (Ciudad Autónoma de) 446 0,97%

Page 144 of 199 In this way, annual data will provide more reliable estimations. So the estimator used is the annual average calculated as average of the quarterly estimators: + 4 ni nu ...n4n1 F ˆ = = +++ YFYFYFY(FYF1/4Y 4/) = h Y ∑∑ hihi ∑ h4h4h3h3h2h2h1h1 ∑ 4 h ==1i 1hi =1h =1h where: ni=sample size in the quarter i, for i=1 to 4 nU=total number of differents units in the sample=union of the four quarterly samples Fhi=original weight of the unit h, if the unit h is interviewed in the quarter i 0 if the unit h is not interviewed in the quarter I Yhi=value of the variable Y for the unit h, if the unit h is interviewed in the quarter I, 0 if the unit h is not interviewed in the quarter I j=re-numeration of the units, from 1 to n1+…+n4 With this method, really there are four scopes, one for each quarter:

SDPD registers with age [16,64] Persons with disabilities by quarters EAPS sample T1 883.010 T1 3.155 T2 876.038 Reference population T2 3.047 => AVERAGE SIZE= 3.028

T3 869.727 T3 2.966 UNION= 5.056 =nh

T4 863.421 T4 2.944 SUM= 12.112=n1+...+n4 The UNION gives the total number of interviewed units differents (common units in different quarters are taken into account once) the estimator can be wroten as a lineal combination of the responses for each quarter. The SUM gives the total number of interviews, independently of the units (the unit can have from 1 to 4 interviews in a year), the estimator can be wroten as a Horvitz- Thomson estimator whose weights are equal to the original ones divided by four.

3.3.2. Final weights

The survey process includes a calibration of the design factors using the following auxiliary variables inside each autonomous community: - X1: Population aged 16 years and over by age groups and sex. - X2: Population aged 16 years and over by autonomous community and nationality, Spanish or foreign. - X3: Population aged 16 years and over by province In the case of estimations for persons with disabilities: - Due to the population with disabilities represents a small part from the total population (5,1% in the case of Spanish population with age [16,64]). - As disability is not an objective in the design of EAPS, it is expected that the subsample of persons with disability in EAPS would produce subestimations in the results.

NUMBER OF PERSONS WITH DISABILITIES-FIRST TEST ESTIMATION SEX Dis.Pers. ESTIMATIONS SDPD (1) DIFFERENCES TOTAL 717.221 960.403 -25,3% (1) SDPD size: MEN 406.659 540.294 -24,7% initial cleaning state WOMEN 310.562 420.109 -26,1%

Page 145 of 199 Hence, for obtaining data for persons with disability, the EAPS ones are calculated again by applying new auxiliary variables to reweight and adjust the survey estimates to the information from the SDPD in each autonomous community. As EAPS results have already been delivered and disclosed, it cannot be obtained a different number of employed/unemployed Spanish people referred to the same period. Therefore it is also needed to adjust the survey estimations to the previously disclosed results. So finally, the EAPS reweighting procedure is recalculated including now inside each autonomous community (in the same way that EAPS): a. Adjustment to the same variables than the original EAPS - X1. Population by province - X2. Population by age & sex groups - X3. Population by nationality, Spanish or foreign b. Adjustment to the main EAPS results (quarterly) - E1. Number of employed by sex - E2. Number of unemployed by sex - E3. Number of inactive by sex - E4. Number of households c. Adjustment to the principal SDPD information - B1. Disability population by sex - B2. Disability population by age - B3. Disability population by impairments - B4. Disability population by severity

As a consequence of this calibration, the average weight for persons with disabilities grows up, while the average weight for persons without disabilities lightly decreases.

AVERAGE EAPS WEIGHT Original Final Average WEIGHTS WEIGHTS WEIGHTS TOTAL 280,5745 280,5714 Persons without disabilities 281,9268 280,5041 Persons with disabilities 234,5017 282,8635

4. Results, comparison with other sources

The “Employment of the persons with disability” (EPD) results can be compared with the Disabilities, Independence and Dependency Situations Survey (DIDSS-2008) ones

PERSONS WITH DISABILITIES PERCENTAGE OF THE [16,64] POPULATION

EPD- DIDSS- 2008 2008

Total 2,8% 4,8% With legal disability certificate 4,3%

PERSONS WITH DISABILITIES AND ACTIVITY EPD-2008 DIDSS-2008

Page 146 of 199 Total Total 873,3 1.322,2 Total 292,3 408,0 Total popu- Active Employed 244,6 331,5 lation Unemployed 47,7 76,5

Inactive Inactive 581,0 914,0 Does not Known Does not Known 0,0 0,3 The main differences between both surveys are: c) The objective population that in the DIDSS is the population with limitations of activity and with restrictions of participation in the everyday situations, while in the EPD it is the persons with legal disability certificate d) The way of collecting the information, because the DIDSS collects them from the auto-declaration of the persons, who have answered both if they have limitations or restrictions and disability certificate,, while the SDPD are collected from the official database of disability certificates. The autodeclaration always overestimates the real results because there are persons declaring certificates only for having been evaluated (though they haven’t got more than 33% of disability degree), or only for having some kind of physical problem (and they don’t have official certificate)… e) The different period of collection: the complete year 2008 in EPD, from 2007- November to 2008-February in DIDSS In summary, different objectives, different samples and different periods of collection. As a consequence, the DIDSS shows greater figures on disability than the EPD

But, without forgetting the differences in definitions and scopes, and taking into account that the main objective of the EPD is to provide information about the activity of the persons with disability, the correct point of view to compare the EPD with DIDSS is in percentage terms.

The comparison between the percentages of persons with disability who are employed, unemployed or inactive shows similar results in both surveys

5. Conclusions

The Employment of the persons with disabilities is a survey that has been carried out through a low priced and efficient method. It is a model of utilization of administrative sources that provides reliable and periodic figures on an preferential variable that is object of social and labour policies.

Page 147 of 199 Comparative analysis of different income components between the administrative records and the Living Conditions Survey

Jose Maria Mendez Martin National Statistics Institute (INE-Spain) [email protected]

Abstract:

The Encuesta de Condiciones de Vida (Spanish SILC Survey) is an annual survey carried out by the National Statistics Institute (INE-Spain). The primary aim of this survey is the systematic production of statistics on household income and living conditions. The survey, which is harmonised across EU countries by a Community Regulation, provides comparable data about the level and composition of poverty and social exclusion.

Access to administrative records offers a good opportunity to improve the quality of income data and allows the use of a more efficient collection method. This paper offers a comparative analysis of different income components by linking the survey data – at microdata level using the Spanish Tax ID number (NIF) – with available data from the Spanish Tax Agency or Social Security system.

Keywords: Living Conditions Survey, administrative records, household income

1. Introduction

A difficult task in household surveys is the collection of income data through personal interview. This type of variables usually has a high rate of partial non-response, and therefore imputation is needed to calculate the total disposable household income. Besides, in SILC, income must be recorded both gross and net and in many cases, the respondent cannot give gross amounts. Then gross amounts must be obtained using net- gross conversion models.

Access to administrative registers would give us the opportunity to improve the quality of income data and reduce the respondent burden. The link between the individuals in the sample and the data available at the Tax Agency or the Department of Social Security, at microdata level, would provide us with detailed information on the majority of income components.

There are several methodological issues that need to be addressed when accessing this type of data, including the availability of a NIF (common variable of personal

Page 148 of 199 identification in the SILC and the administrative records) and the mapping of the concepts used in the SILC on to those of the administrative sources.

Until the 2008 SILC, the data collection process did not include the entry of NIFs (personal identification). A list of households was used for data collection, to which a reference person was assigned. For this study, data from the 2007 SILC were used and the NIF was assigned afterwards. It was possible to obtain NIFs in approximately 80% of cases. These records were linked with Social Security data on social benefits and with data from the Tax Agency on different income components.

Since the 2009 SILC, data collection has been adapted to make use of the municipal register of inhabitants, indicating the people registered in the household (with their associated details, full name, date of birth, NIF, etc). A NIF will be available for approximately 98% of adults.

This study makes a comparative microdata analysis of a selection of household income components using data from the 2007 SILC. The information collected in the survey is compared to the data available in the administrative records. A study is attached at the end on the impact of the use of administrative records on the basic indicators obtained from the SILC. The results presented here should be interpreted with caution due to their partial coverage, given NIF availability in the 2007 survey.

We would like to thank the Spanish Tax Agency and the Department of Social Security for their invaluable assistance in providing the necessary information for this study. We would also like to express our gratitude to the various units of the INE for their support in this project.

2. Analysis of Social Security information

2.1. Information from the Social Security system

Social Security databases have relevant information about social benefits paid to households. There is information in a centralized Register (Social Benefits Register) about social benefits paid by different public bodies (Social Security, Autonomous Communities, Other Public Bodies).

A very precise statistical classification must be adopted for social benefits. The social benefits included in the SILC must be converted following a classification based on ESSPROS (European system of integrated social protection statistics), which harmonises the presentation of data on social protection.

2.2. Comparative analysis

The information of the 2007 SILC survey was linked with Social Security data on the social benefits paid to people aged 65 and over (NIFs were available for 82% of this group).

Page 149 of 199 In the first analysis, differences are observed in the type of benefit received. For example, some benefits are considered by the survey to be non-contributory old-age benefits, while Social Security records consider them to be contributory old-age or survival benefits.

Comparison of amounts. We can see a certain underreporting in the amounts of social benefits included in the SILC, as shown in the graph of the distribution of the relative difference, at microdata level, between the value of the amount in the administrative file and the value of the amount in the survey.

Figure 1: Social benefits. Difference between the Soc. Sec. system and the survey

3. Analysis of Tax Agency information

3. 1. Information from the Spanish Tax Agency

The information contained in personal income tax returns is detailed enough to work out the various components of income for the households in the sample. However, there may be some difficulties: firstly, there is a rather large group of people who are not required to file returns and, secondly, the possibility of filing joint returns can make it difficult to identify individual incomes, which is almost always necessary with the SILC.

As a result, we require access to other information available at the Tax Agency. Besides personal income tax returns, the Tax Agency has a series of self-assessment forms containing very valuable data and information models presented by withholders, which even include tax-exempt income or income on which no withholdings have been made.

Specifically, the information supplied to INE in this study includes:

- Filed returns (individual and joint). These returns contain data on income broken down into different components.

Page 150 of 199 - Imputed individual returns (individual tax information). These contain individual information for certain sources of income, based on information from the Tax Agency.

The geographical scope is Spain with the exception of the Basque Country and the region of Navarre. ILC.

In relation to the data transmissions between the National Statistics Institute and the Tax Agency, a specific procedure for these tests has been used. Nevertheless, in the future production of the Spanish SILC we will implement a safe connection with the Tax Agency using a Web Service. The Tax Agency has this web services for the supply of information to Public Administrations for non-tax related purposes. Using a safe internet connection the National Statistics Institute sends the personal identifications (NIF) and the Tax Agency returns immediately the requested information.

3.2. Comparative analysis

3.2.1 Interest, dividends and profits from capital investment

All adults were taken from the survey (28,656). By eliminating those residing in Autonomous Communities with the charter system (this leaves 26,237), this gave us a coverage of NIF availability of 79%, or 20,677 people.

Investment income is analysed on a per-household basis. Hence, we selected the households in which a NIF was available for all of its adult members. This gave a total of 15,804 people (76% of the previous figure).

If we exclude small amounts, we see that a large percentage of households claiming to have no investment income in the survey actually do according to the Tax Agency.

Some households also indicate in the survey that they have income from investments, but actually do not according to the Tax Agency. This is possibly due to the inclusion of investment funds, which the Tax Agency considers as capital gains.

Table 1: Distribution of households by investment income (SILC and Tax Agency) (income over EUR 100) (sample data). Horizontal percentages Tax Agency Number of Total T1. With investment T2. Without observations income investment income Survey E1. With investment income 1,052 100.0 83.7 16.3 E2. Without investment 6,271 100.0 34.7 65.3 income Total 7,323 100.0 41.7 58.3

Comparison of amounts. If we analyse the distributions of the two sources, we can see a significant underreporting in the amounts of investment income in the survey.

Page 151 of 199

3.2.2 Employee income and self-employment income

In this analysis, to avoid any overlap with social benefits (which also have the consideration of earnings from employment in the income tax system), we selected from the survey all people aged 18 to 64 years who stated that they were employed or self- employed for all 12 months of the year and did not receive social benefits. This gave a total of 12,047 people.

Of this figure, those residing in Autonomous Communities with the charter system were eliminated (leaving a total of 10,954 individuals). This gave a coverage of NIF availability of 79%, which left 8,613 people in the end. The analysis in this section is on a per-person basis.

Earnings from employment can be classified as earnings from salaried employment (employee income) or as earnings from self-employment (self-employment income). There is not a complete correspondence between the two sources for this classification, since some businessmen and women set up companies and are listed as employees by the Tax Agency. It is also possible that workers who are self-employed according to the Tax Agency and who, for example, work for a single client, may be seen as salaried employees in the SILC.

Table 2: Distribution of individuals by earnings from salaried employment or self- employment (SILC and Tax Agency) (sample data). Percentages Tax agency Total T1. Only T2. Only T3. With T4. No earnings from earnings from earnings from earnings from salaried self- work salaried employment employment employment and self- employment Survey E1. Only earnings from salaried 80.9 71.2 0.6 4.7 4.4 employment E2. Only earnings from self-employment 14.9 3.2 8.0 2.3 1.4 E3. With earnings from salaried 1.5 0.4 0.2 0.8 0.0 employment and self-employment E4. No earnings from work 2.7 0.5 1.3 0.4 0.5 Total 100.00 75.3 10.1 8.3 6.3

A separate comparative study will now follow of earnings from salaried employment and self-employment.

Self-employment income

Comparison of amounts. We can see a significant underreporting in the amounts of earnings from self-employment of the Tax Agency, as shown in the graph of the distribution of the relative difference, at microdata level, between the value of the amount in the administrative file and the value of the amount in the survey.

Page 152 of 199 Figure 2: Self-employed. Difference between the Tax Agency and the survey

Note that in the case of objective tax assessment (modules system), the “net reduced earnings” were taken as income for the Tax Agency, although these are actually an imputation of profit from the activity.

Employee income

For earnings from salaried employment, a separate study of the formal and informal economies is conducted,1 given that a different behaviour is detected. In the case of the formal economy, regular earnings lead to a similar situation to that of social benefits. In the case of the informal economy, the situation could go in the direction of earnings from self-employment.

Comparison of amounts. An underreporting is seen in the salary amounts of the Survey in the formal economy and a slight underreporting is seen in the salary amounts of the Tax Agency in the informal economy.

1 This study adopts a basic breakdown of the formal and informal economies, based on economic activity and the number of persons working at the local unit of activity: - Informal economy: local unit working 10 workers or less or economic activity (NACE Rev. 1) = (1, 5, 14, 18, 19, 22, 29, 31, 36, 37, 45, 50, 51, 55, 63, 67, 70, 72, 74, 91, 93, 95) - Formal economy: Others

Page 153 of 199 4. Impact of the use of administrative records on indicators

We will now study the potential impact of using administrative files on the basic indicators produced from the Living Conditions Survey. Where possible, this simulation will attempt to replace the survey data with the data from the administrative file. If this substitution cannot be made, the original survey value will be left. No records are eliminated.

The basic indicators of the SILC based on household income are of two types: firstly, indicators measuring the distribution of income (relative poverty rate, Gini coefficient, etc.) and, secondly, indicators based on level of income (average income, poverty threshold, etc). This report will analyse the impact of using administrative records on the relative poverty rate (broken down by age brackets) and on the average equivalised household income.

Table 3 contains the indicators, using different sources of income. The first column contains the original survey results, together with the 95% confidence intervals. We then take the value of social benefits obtained from the Social Security system and recalculate the indicators.

The last two columns incorporate information from the Tax Agency, taking investment income and earnings from salaried employment and self-employment (in the case of self-employment, we take the maximum of the amount recorded in the survey and the amount indicated by the Tax Agency). The last column calculates the indicators using the methodology of the maximum amount for earnings from salaried employment in the informal economy.

Table 3: Impact of the use of administrative records on indicators (poverty rate and average equivalised household income) With soc. With soc. With soc. benefits benefits benefits (Soc. Sec.) and (Soc. Sec.) (Soc. Sec.) and investment income, investment self-employment income, (maximum) and self-employment salaries (maximum in Confidence interval (maximum) and informal economy) (95%) salaries (Tax (Tax Agency) Survey Lower Agency) value end Upper end Poverty rate Total 19.7 18.3 21.1 19.7 19.6 19.9 Under 16 23.4 19.9 26.9 23.8 24.6 24.4 16 to 64 years 16.8 15.5 18.1 17.0 16.9 16.8 65 years and over 28.5 25.4 31.6 27.0 26.1 28.3

Average equivalised household income 13,613 13,293 13,933 13,674 14,202 14,539

The table above shows that:

Page 154 of 199 - If social benefits from the Social Security system are included, the relative poverty rate of older people is reduced, since the amounts in the administrative file were higher on average. The reduction is not significant and remains within the confidence interval.

- If we also take the information from the Tax Agency, the situation is close to the original one. In the last column, we take the earnings from salaried employment, making a distinction between the formal and informal economies (for the formal economy, the data is taken from the administrative file and, for the informal economy, the maximum is taken from the administrative file and the survey data) and, in the case of earnings from self-employment, we take the maximum of the amount recorded in the survey and the profit declared to the Tax Agency.

- In relation to the average equivalised household income, it increases with the change in methodology (the recording of earnings progressively improves) obtaining a significantly higher value than the original one.

5. Conclusions

In this paper, we present the preliminary studies on the analysis of the linking of information on household income from the Living Conditions Survey and from data contained in administrative records.

For each component of income, we observe different situations in the comparison of the income amounts and in the classification of the income recipient.

In the calculation of the basic indicators using administrative sources, we see that the use of administrative records does not appear to have a significant impact on indicators based on distribution of income. However, it does have an impact on indicators based on income level as it significantly increases their value.

References

Regulation (EC) No. 1177/2003 of the European Parliament and of the Council of 16 June 2003 concerning Community statistics on Income and Living Conditions (EU-SILC). Official Journal of the European Union (Law), Vol. 46, No. 165 (3 July 2003). INE. Encuesta de Condiciones de Vida. Metodología. www.ine.es INE. Encuesta de Condiciones de Vida. La pobreza y su medición. Análisis de la renta y el gasto de los hogares. www.ine.es

Page 155 of 199 Administrative data as input and auxiliary variables to estimate background data on enterprises in the CVT survey 2011

Eva-Maria Asamer Statistics Austria, Guglgasse 13, Vienna, [email protected]

Abstract: In this paper an example for adding information from administrative data is introduced. A step-by-step method to transfer information from one to another survey with some administrative data in common is described in detail. For the example of total hours worked, it is shown that this method, using days worked as auxiliary variable, leads to good results on enterprise level.

Keywords: administrative data, combining surveys, linear regression

1. Introduction

Our approach of combining different, already existing administrative sources as well as integrating other survey data to reduce the response burden will be presented for the 4th Continuing Vocational Training Survey (CVTS4). For the CVTS4 in Austria a number of items are not collected directly, but added using administrative sources. This includes the total number of persons employed, total labour costs and the total number of initial vocational training participants. These variables can be determined using mainly administrative sources, like the social security and tax registers. Those data are linked on enterprise level. For data with missing links, other sources are investigated. In contrast, the total number of hours worked cannot be determined by administrative data directly. For CVTS, a procedure to estimate this variable using other surveys, with administrative data (primarily the number of days worked full time and part time per enterprise) as auxiliary variables, was developed. Here, different reference periods and unequal definitions of the population have to be taken into account. This year the total number of hours worked is being collected directly as well as being estimated, providing the opportunity to evaluate the estimation procedure. Furthermore the estimated values are used for control and imputation. The matching and estimation procedures along with first results will be discussed.

2. Data Sources

The Survey on continuing vocational training (CVTS) is an EU survey performed every five years, asking a variety of questions on initial and continuing vocational training in enterprises. Apart from these variables a number of structural variables are asked such as the NACE (Nomenclature générale des activités économiques dans les Communautés européennes) category or the number of people employed. Further, the total labour costs

Page 156 of 199 and the total hours worked are asked as reference values as well as to determine indirect costs of vocational training due to the non-productive time while attending a training course. To ease the response burden for enterprises, as many variables as possible are added using existing administrative data. This is especially important as it is an optional survey and the more questions have to be completed the less likely the form will be completed at all.

For business data, a variety of administrative data is available in Austria. The main sources, social security and tax information, as well as data of the chamber of commerce are included in the statistical business register (BR) maintained by Statistics Austria. Every enterprise has a unique key in this register, and foreign keys from external data sources are matched to this key.

Preparing the first register based census in Austria 2011, an Austrian Activity Register (EVA (German: Erwerbstätigen-Versicherten-Arbeitslosen-Datenbank)) was build. Here, data from social security, tax register, and unemployment register is available on personal basis containing an anonymous key. In EVA, the business register key is available as a foreign key. Also, the key for the local units of work is used according to the business register. The links are not available for all combinations, but there is steady work going on to improve the linking within EVA as well as to the BR.

Some information, like some parts of labour costs, or the hours worked by person or by enterprise, is not included in administrative data. But there are some business surveys containing this information. The labour cost survey (LCS) is held every four years in Austria, the last time in 2008. In this survey, enterprises are asked about the average number of employees, subdivided by full-time and part-time workers, with apprentices counted separately. The main focus of the survey is labour costs, which are asked in detail. Furthermore the total amount of hours actually worked is asked for the subgroups separately.

Every month, data is collected for a short term statistics survey in industry and construction (KJE (German: Konjunkturerhebung)), which is published every year. Small companies do not have to answer the survey, but their values are estimated. For all bigger companies some information is included from the BR, other data is asked directly. The economic sectors of KJE are industry and construction, which is only a part of the enterprises of interest for CVTS. But for those enterprises within the NACE and size categories, data for the same period of time as CVTS is available.

3. From administrative to statistical data on employment

A key variable of an enterprise is the number of persons employed. Employed persons contain employees and self-employed persons, with persons in training counted separately. Further, male and female should be counted separately as well. The number of persons is asked on two reference days, 31.12.2010 and 31.12.2009, as well as an average for the year 2010.

Page 157 of 199 In EVA, all periods of employment since 2002 are stored. But whereas the periods concerning employees usually hold a connection to the enterprise they work in, working proprietors hold no connection to their enterprise in EVA. Thus in a first step all variables are calculated for employees only. For all enterprises in the CVTS sample these preliminary variables could be determined. For working proprietors and family members a matching procedure is being developed at the moment for the census 2011, using data from 2009. First results from these matching procedures are used to add self- employed persons to the number of persons working.

To determine the total labour costs for the year 2010, administrative data was used too. In EVA, tax and social security information is available per person (with an anonymous key) and per enterprise. The social security information is used to determine all persons working in the companies asked in the given period of time (the whole year 2010). Furthermore, there is some income information in the social security data, but only above the marginal income and up to a maximum income, so this income information is only used if no link to tax information exists. For the majority of people the salary from the tax register is processed. A simple model is applied to determine the statuary social security contributions for the enterprises per person employed. This linking of tax information to persons and enterprises is already standardized in EVA, using different data sources to improve missing links. This linking information is available in so-called “linking-tables” containing an additional quality attribute.

4. Estimation from other surveys

EVA contains no exact information about the hours worked for a company. In Austria, there exists no administrative data source (apart from some very small subgroups) where hours worked are registered, neither on personal nor on enterprise level. There exists a variable full time or part time, a variable whether a person is holding a marginal job, and some information about the yearly income and the period of time a person worked in the past years. Total hours worked are included in a variety of partly compulsory business surveys, though, which can be used as basis for estimation.

In 2006, for CVTS3 a procedure to estimate hours worked from another survey was developed, but as the true values were not known, the operating department was not sure about the quality of these estimates. This time the item was asked directly, but a high item-non-response rate is expected. So the estimation process was performed for CVTS4 too.

First challenges to be faced are different basic populations (e.g. different NACE sections, different minimum employees) and different definitions of hours worked (e.g. breaks or waiting time either counted or not). So as preparatory work, a variety of surveys performed by Statistics Austria were analyzed and the population range as well as the definitions and subdivisions were compared. This has to be done for every new project, and it is recommended to monitor changes in questions and definitions of the surveys analyzed as those can change over time as well.

Page 158 of 199 For CVTS the labour cost survey (LCS) proved to be appropriate regarding the definitions of hours actually worked, as well as the subgroups available. The KJE survey is available only for half of the NACE categories, and only data with similar definitions and for wider groups are available. On the other hand, data for the same period of time as CVTS will be available soon. This information can be used to verify the assumption that there are no changes in the relation of administrative information (e.g. number of days worked) and hours worked.

To build a model to transfer worked hours from one survey (LCS) to another (CVTS), a stepwise process was performed.

First, administrative data for the period of time and the enterprises in the LCS was extracted. From this data a reference value, in this case the average number of employees in the reference year 2008 was derived. This value is asked in the survey and can be determined from the administrative source as well. We found that such a reference value is very useful to eliminate data sets which are not plausible and would therefore worsen the quality of the model. These data sets can appear for a number of reasons, like missing links in administrative sources, different definitions of units, different classification of employees, or wrong data in the survey. As benchmark the proportion of the administrative value (Xadmin ) to the survey value (Xsurvey) is used:

ǀ1-( Xadmin / Xsurvey) ǀ ≤ α (1)

Only data which lay within this threshold (α) is used in the next steps. In our sample, there were 6585 enterprises in the NACE categories of interest, 6105 data sets were lying within the threshold α = 0.25 and were therefore used for estimation.

In a second step the amount of part time and full time workers was derived from administrative data. If those values differ strongly from the amount in the survey, but the overall number of employees is similar, a model without distinction by these categories is used, as here the definitions of full and part time are apparently not the same in administrative data and the survey.

Next, auxiliary variables are derived from administrative data. In our case, the number of days worked in the reference year, for full time employees, part time employees, and apprentices separately, as well as for all employees together, are calculated. First for enterprises in LCS, then these variables are calculated analogously for the reference period and the enterprises of the CVTS, so they can be used as independent variables in our model.

Before the model is estimated, the scope of the samples has to be considered. For CVTS, the NACE sections B to N and R, S are part of the survey. For LCS, the sections P and Q are surveyed too. These NACE sections do not need to be modeled, as they won't be transmitted to the CVT-Survey. A different scope of size of the enterprise could be considered as well, for CVTS and LCS they are nearly the same, so no further cut is made.

The following formulae are assumed:

Page 159 of 199

dft * wft = hft dpt * wpt = hpt (2) (dft +dpt )* wtot = hem dft … days worked full time dpt … days worked part time wft ... average hours actually worked per day by full time employees wpt ... average hours actually worked per day by part time employees wtot ... average hours actually worked per day by employees (without apprentices) hft... total hours actually worked by full time employees hpt... total hours actually worked by part time employees htot... total hours actually worked by employees (without apprentices)

Theoretically, there should be no intercept, as zero days of work imply also zero hours of work, but this constraint was not set in advance.

It is assumed that the total hours worked per day is not equal for all NACE categories, so regressions are estimated by groups of categories, which we found by analysing the data. Data is also divided in different classes of size, to ensure the dependent variable htot is normally distributed. First models for full and part time employees together were estimated. The models show a good quality with an adjusted R² between 0.75 and 0.99, and corroborate the hypothesis that wtot is significantly different for groups of NACE categories. As we found some outliers with a high leverage in our data, we decided to use robust regression to minimize these effects.

In a second step full and part time employees were examined separately, but these models did not lead to a significant improvement. Apparently, dividing part and full time employees according to administrative information is not similar to the separation performed by the enterprises for answering the survey.

These models were now used to estimate the hours worked in the enterprises of the CVT survey. First raw data has been just available and is used to determine the quality of the estimation models outside the data used for building the models. Also for CVT data, a benchmark variable is used to determine whether the questionnaire is answered for the same unit as identified in the administrative data. For CVTS, the number of persons employed on 31th December 2010 is asked as well as calculated from administrative data. Again, formula (1) is used to filter those subjects with a similar value in this variable. For these data sets the estimation model is applied.

The estimated values are then compared with the values answered in the survey. From 1230 enterprises for which the survey has been answered so far, 986 responded on the item total hours worked, 890 of them are within the filter of formula (1). For 80% of them, the ratio of estimated and survey value is between 0.75 and 1.25.

Page 160 of 199 For the smaller enterprises the results are plotted in Figure 1. Here the squares belong to enterprises which did not pass the filter of formula (1), the circles belong to the remaining. The dotted line corresponds to estimated values = survey values, the solid line is the mean line of our data, whereas the dash-dotted line would be the mean line of all data, including the squares. The dashed lines is the (0.75, 1.25) interval. For those further apart, a plausibility check is performed by the operating department, especially for those with a survey close to zero.

100000

80000 Filtered data

Data passed filter

60000

40000

Total hours worked survey worked hours Total

20000 0

0 20000 40000 60000 80000 100000

Total hours worked estimated

Figure 1: Estimated vs. Collected Data on hours of work

Page 161 of 199 5. Conclusion and further work

In our case study, estimating total hours worked using total days worked derived from administrative data as auxiliary variables led to reasonable results. Considering that the survey value is not always exact for this item, a use of models to lower the response burden or improve the quality of imputation seems advisable. Filtering data with a benchmark variable avoided applying the model where administrative and survey data did not correlate. For these units other methods of imputation should be used.

When KJE data for 2010 is available, there will be a model estimated using this data and an analysis about possible changes in the weights wi will be done.

The whole procedure of processing the administrative data will be automated. In a next step, also the selection of the usable data files from the “donor” survey will be standardized, as well as the selection of the data files from the receiving survey. Calculation of labour costs will be done using already processed data from EVA.

A further problem to be tackled will be the determination of the quality of these variables in the new survey.

References

Čiginas A., Kavaliauskienė D., Overview of use Administrative Data in STS, ESSnet Seminar, Rome, March 2010. Bundesstatistikgesetz 2000, BGBl. I Nr. 163/1999, Austria. Fox J. (2002) Robust Regression in: An R and S Plus Companion to Applied Regression, Appendix, Sage. Salfinger B., Sommer-Binder G., Erhebung über betriebliche Bildung (CVTS3) in: Statistische Nachrichten 12/2007, p. 1106 – 1119. Silva D.B.N. and Clarke P., Some Initiatives on Combining Data to Support Small Area Statistics and Analytical Requirements at ONS-UK, IAOS Conference, Shanghai, 2008.

Page 162 of 199 Transforming administrative data to statistical data using ETL tools

Paulina Kobus, Paweł Murawski Central Statistical Office, Poland, [email protected], [email protected]

Abstract: This paper focused on Administrative sources and measures of data, ETL tools as instruments of transformation of administrative records in statistical registers. It presents develop its various stages, starting from the extract data. Indicate the sample registers processed by the public statistics. The most widely developed part of the paper will be issues related to data transformation and the transformation of public registers into the statistical register. Then briefly discuss the final stage of the ETL process – which is loading. At the end, the summary will focus on the problems and difficulties associated with transformation of data and the benefits of administrative data for statistics.

Keywords: data integration, ETL tools, statistical register

1. Introduction

The aim of this paper is to provide processing of data from administrative sources, as part of the ETL process, covering all activities in the data sets in such a way as to obtain a result of statistical register, which is a complete set of data allows to carry out research in official statistics.

2. Extract data

In the section on extract data is presented load process to the database and work on consolidation data from various source systems, extract data into the production environment based on the SAS software and converting data into one format that is

Page 163 of 199 suitable for processing – SAS tables Extract data into the production environment based on the SAS software. The first stage of work on datasets is to extract them and put into the production environment based on software from SAS Institute. We use the application called Data Integration Studio, as well as Enterprise Guide. Obtained data has various format, for example txt,. xls, .csv, xml, MS SQL databases. Import means to consolidate data from various source systems and converting data into format that is suitable for processing - SAS tables. An integral part of the import is to check the correctness of the data and its structures. These include in particular, the number of imported records (if agrees with the number of records submitted by the provider of information) and verify the correctness of assignment of data to individual columns (that means checking if text values contains the text, if the length of the field is suitable for data variables, etc.)

3. Transform data

Data transformation means a series of activities in the production environment consisting of: profiling - the creation of a report on data quality, unification/ standardize data, parsing (separation) or combining variables, standardize with schemes, conversion, validation, deduplication, data integration. When dataset is successfully extracted, a process called profiling take place. We create a profile / report of the quality of data, so we can check (at the level of numerical and percentage) the rate of errors for each variable in the set. In profiling, we can obtain information about the number of completed records, the number of unique entries, patterns and incorrect data. The next step is data standardization – unification and brought down to a defined standard – values occurring in certain columns. Parsing is a separation of variables - for example, the division of one column 'address' on columns: 'street', 'town', 'home number' or partial name and surname from one text field.

Page 164 of 199 Table 1. Example of standardization data Incorrect data format Format after standardization 1985-02-21 19850221 1985.02.21 19850221 1985 02 21 19850221

The example shows variable date - before and after the unification.

Voivodeship City Street Place of birth MAZOWQIECKIE WARZSWA ul. DŁUGA LONDYN - ANGLIA MAZPWOECKIE WARS-AWA Ulica LONDYN – WLK BRYTANIA DŁUUGA ZAZOWIEVCKIE AWRSZAWA DLUGAA LONDYN/CHELSEA MZAOWIECIE WARSZAAAWA DŁUGA (ul.) LONDYN BRIDGE

Table 2. Example of standardization data Voivodeship City Prefix Street Place of birth MAZOWIECKIE WARSZAWA UL DŁUGA LONDYN

The above example illustrates effect of parsing and also standardization variable ‘street’. After this processes, incorrect values are replaced by the correct values.

The next step transform data is validation. The validation is a process of checking the correctness of data and correcting abnormal values according to the algorithms prepared by methodologists. Sometimes it is also necessary to exclude from further processing records, which improvement is impossible. Through this process we are able to obtain better quality of data. Validation is performed on the datasets already pre 'cleaned' in the previous stages of work. Another action is data deduplication. Deduplication is the process of removing repeating units and merge the information in the same records. It requires a detailed

Page 165 of 199 analysis, often including legal acts analysis. It is individual for each register. As a result of deduplication – we obtain one unique record of all the possible and unique information. One of the last action is aata integration is a process of selection the best, most current and correct value of several or a dozen of registers. It is a process to create a statistical record, which will be available for use by analysts.

3. Loading data

Statistical register is transferes from the production area to the analytical environment. In this process it is important to use mechanisms for quick loading large amounts of data. In the analytical area further work on the data goes on, such as production of summary tables or generate reports.

Page 166 of 199 Case study – Job churn project at CSO, Ireland Dunne, John Administrative Data Centre, Central Statistics Office (CSO) Skehard Rd Cork, Ireland Email: [email protected]

Abstract The paper will cover experiences from the Job Churn Explorer project at CSO with a particular focus on a sectoral flow analysis of job separations - where do those leaving jobs get re-employed. The project adapts and develops the underlying methodology outlined to date to the situation in Ireland to provide a detailed insight into the dynamics of job churn and its components as Ireland entered the current recessionary period.

The analysis datasets used are derived from linking the following three sources

- Business register

- Employer tax returns

- Social Protection records

The comprehensiveness of the resulting analysis dataset, containing attributes on both workers and enterprises, provides for significant new opportunities to inform policy and decision making with respect to the labour market.

While the data integration employed for this project is simple and straightforward, there are significant opportunities with these datasets through record linkage and data integration techniques with other sources such as the Labour Force Surveys (person based surveys) and other business registers/surveys (business based attributes).

It should be noted also that the resulting analysis datasets from this project also provide a key link between business and social statistics.

Page 167 of 199

Introduction and Background

- Demand from users Ireland has, in recent times, experienced an unprecedented period of sustained growth followed by a sharp downturn at the end of 2007. This sharp downturn has had a significant effect on employment in Ireland with the unemployment rate rising from 4.6% in Q3 2007 to 13.9% in Q3 2010. GDP at constant market prices1 fell nearly 12% from just over €186bn in 2007 to just under €165bn in 2010.

Jobs and any insights that can be obtained about the labour market that can inform decisions is, therefore, of significant value to government, business, workers and work seekers (Fox, 2009).

In particular, survey vehicles such as Labour Force Surveys are limited in their ability to track the flow of workers between jobs.

The work presented in this paper is informed by that in international literature and builds on the increasing relationship between CSO and Government organisations with respect to exploiting administrative data for statistical purposes. In particular, this work allows for significant insights into job churn and its components in the Irish jobs markets, in other words the work provides information about those leaving, staying or taking new jobs and the firms in which these jobs are located.

- Strategic context In developing the Irish Statistical System, significant emphasis is put on exploiting the untapped statistical potential of administrative records. This project serves to demonstrate this potential by delivering an example. Furthermore, the work done in this project also provides for a key linkage between business and social statistics through linking persons with businesses through an employment relationship.

Summary review of literature There is a significant amount of literature available with respect to investigation of job churn and its respective components. Significant challenges are presented in much of the literature with respect to bringing the firm based components, job creation (JC) and job destruction (JD) together with the person based components, hirings (H) and separations (S) as typically they are derived from different sources. However, with the increasing recognition of the value of administrative data for statistical purposes, the use of employer-employee returns to tax authorities is of high value due to the ‘single source’ nature of the data when calculating and comparing the various components. Work has been identified in the US (Burgess, Lane, & Stevens, 2000), Finland (Ilmakunnas & Maliranta, 2001), Germany (Guertzgen, 2007) and Norway (Li, 2010) where employer-employee linked data sources has been able to facilitate a more comprehensive and in- depth insights into both job and worker components of job churn and how they interact with each other. The

1 Chain linked and referenced to year 2008

Page 168 of 199 potential of such linked datasets is significant for obtaining insights into the movements of jobs and workers. These insights are of particular value to evaluating and informing policy analysts with respect to market dynamics for both jobs and workers.

Bassanini et al (2009) brings together from key papers and presents the underlying theory (including calculations and how they are derived) with respect to job churn in a clear manner with a view to bringing together results from many different studies and countries to undertake cross-country comparisons. The theory as presented has formed the basis of how the author has developed and calculated the various job churn components in this work.

Definitions and Methodology The definitions and methodology used are adapted from those in (Bassanini & Marianna, 2009) to take account of methodology used in Eurostat-OECD Manual on Business Demography Statistics and shortcomings in the available data sources. The available data source does not have point in time measurements. Business Demography Statistics manual uses a methodology where year t is compared with year t-1.

The business unit of observation is that of an enterprise as defined in statistical legislation. Where administrative units have not been properly profiled into statistical units a one to one correspondence is assumed.

The primary variables for analysis at the business unit level are obtained by comparing data between two periods (calendar years) such that the following identity holds for each business unit

where E, JC, JD, H and S represent employment, job creation, job destruction, hirings and separations and ∆ for differences between period t-1 and t.

Employment for the business unit in period t is estimated as the number of valid employment records with non zero reckonable pay2 for that business unit in the period. This estimate does not factor in duration of employment or whether an employment is part-time or full-time in nature.

Job creation is measured as the difference in the number of employment records with non zero reckonable pay between two periods, t and t-1, if that difference is positive, zero otherwise and is assigned to period t.

Conversely, job destruction is measured as the difference in the number of employment records with non zero reckonable pay between two periods if that difference is negative, zero otherwise and is assigned to period t. In order for the identity to hold the jobs destructed figures are assigned to period t even though technically the jobs were lost in period t-1.

Hirings for the business unit are calculated as the number of employment records assigned to an individual in period t for which a corresponding employment record for that individual did not exist in period t-1 with respect to the business unit.

Conversely, separations for the business unit are calculated as the number of employment records assigned to an individual in period t-1 for which a corresponding employment record for that individual did not exist in

2 The primary difference between reckonable pay and gross pay is that reckonable pay excludes any payments to a pension schemes or permanent health insurance schemes recognised by the Irish Tax Authorities.

Page 169 of 199 period t with respect to the business unit. Again, while technically the separations occur sometime in period t- 1, for the identity to hold the estimated separations figure is assigned to period t.

Job stayers (JS) for the business unit are calculated as the number of employment records assigned to an individual in period t-1 for which a corresponding employment record exists for that individual in period t.

Job destruction figures for a group of business units is obtained by summing the figures for the business units in that group (i.e., for a group of business units classified to a specific sector). Job creation, hirings, job stayers and separations for a group of business units are also obtained in the same way.

Total job reallocation (REALJ) refers to the sum of job creation (JC) and job destruction (JD) for a group of business units. Excess job reallocation (EXCJ) for a group of business units is defined as the difference between total job reallocation (REALJ) and the absolute net change in total employment ( |JC - JD| ), for group j at period t.

Excess job reallocation provides a measure of the offsetting job creation and job destruction within a group of firms.

When aggregating over a group of business units with similar characteristics, generally speaking, job creation (JC) can be considered as the sum of employment growth from all expanding and new firms, while job destruction (JD) can be considered as the number of jobs lost from contracting or exiting firms. It should be noted that expanding and contracting business units are assigned these attributes based on volume or number of weeks work paid – therefore it is possible for contracting firms to have job creation and expanding firms to have job destruction (i.e., two employees each with 16 recorded weeks paid compared with one employee with 52 recorded weeks paid).

Worker reallocations are dealt with in a similar manner. Total worker reallocation (REALW) by summing hirings (H) and separations (S) over all members of a specified group, the group can be defined either by a group of firms or on a particular demographic characteristics (age, gender etc). Excess worker reallocation (REALW) for a group is defined as the difference between total worker reallocation (REALW) and the group’s absolute net change in employment (|H - S|). So for group j at period t,

Excess worker reallocation provides a useful measure of the number of job matches over and above the minimum necessary to accommodate net employment growth; in other words, it reflects the reallocation of job matches (reshuffling of jobs and workers) within the same group (Bassanini & Marianna, 2009).

At the business unit level, churning flows (CH) is the difference between excess worker reallocation and excess job reallocation. Churning flows represent labour reallocation arising from firms churning workers through continuing jobs or employees quitting and being replaced on those jobs. So for group j in period t

All flow measures from period t-1 to period t are expressed as rates by dividing flow totals by relevant average employment figures in period t-1 and period t.

In adhering to recommendations in the literature, an average of the number of employments at year t and t-1 is used as the denominator in the calculation of rates with respect to reference period t.

Page 170 of 199 Data sources The datasets used in this analysis come from merging three separate sources as follows

• P35L data source from the Revenue Commissioners on employment records • CRS Client Record System from the Department of Social Protection related to Personal Public Service Numbers (PPSN) • CBR Central Business Register at CSO

The P35L is the primary source of data and contains a record for each registered employment i.e., employer/employee relationship, in the given year. The dataset contains an Employer Registration Number (PREM number) that facilitates merging with the CBR to assign business based attributes and also contains the Personal Public Service Number and CRS to assign person based attributes. The P35L file also contains some records relating to Pension payments and these are excluded from the analysis. The P35L also contains information on number of weeks paid and reckonable pay (for tax purposes) for each employment record which can be used as indicators of job volume and value (and can be combined to give mean reckonable pay or an indicator of job quality). The P35L also contains the PPSN or person based public service identifier3. While this source can in general be considered exhaustive there are a small number of quality issues with respect to statistical purposes worth noting. In particular, the personal identifier when validated showed that for a small number of records (< 5%) an invalid number was recorded. However no significant pattern was identified to these invalid records. An interim decision was taken to work solely with records where the person number is identified to keep the methodology as simple as possible.

The CBR is the Business Register of enterprises maintained by CSO to support the compilation of statistics on business as laid down in EU statistical legislation. The business register became fully aligned to administrative data sources for reference year 2007. In general there is a one to one relationship between the enterprise as defined by the CBR ID and the employer registration number. However in a small number of cases an enterprise group may pass all of its employment through a single PREM number attached to a single enterprise. Another type of exception occurs where an enterprise can comprise of a number of legal units and hence have multiple PREM numbers. The CBR also does not have comprehensive coverage of all employment sectors. These difficulties arise due to the lack of a Unique Business Identifier across all public administration systems and also the lack of a standard methodology to profile enterprises in the Public Sector.

The CRS is a master register of all PPSNs assigned and contains information collected at registration on date of birth, sex and nationality as declared by the applicant. Nationality has only been collected since 2002. Any PPSNs assigned prior to this period are assumed a nationality of Irish for the purposes of creating the analysis datasets. This is done on the basis that prior to 2002 Ireland did not have the same influx of foreign nationals as it did after the enlargement of the EU to EU25. The PPSN came into being in 1998 and replaced the old RSI number used for tax and social welfare purposes. The purpose of the PPSN is to uniquely identify persons/customers when engaging or transacting with the state and is assigned when a person first interacts with the State. For those born in Ireland the PPSN is assigned shortly after birth (and is required to avail of child benefit). It is acknowledged that there are some quality issues with respect to PPSNs inherited from the old RSI number such as duplicate numbers, persons being assigned more than one RSI number or an identical RSI number (with a suffix of M or F) for husband and wife. However for statistical purposes these quality issues are not considered significant.

3 In line with its data protocols, CSO replaces the official PPSN on analysis based datasets with a proxy for PPSN called the CSOPPSN. It is this proxy that is used to link person based data.

Page 171 of 199 Two units of observation are available in the data for businesses, the first the Employer unit refers to the unique registration number of each employer while the second refers to the statistical definition of an enterprise in EU statistical legislation as applied in Ireland. The project uses the latter.

In summary, the CBR contributes the legal form and activity breakdown (NACE 1 and 2) attributes on the enterprise while the CRS contributes DOB, nationality and sex attributes on the person. This paper works with NACE Rev 2.

The P35 data source is available for reference years 2005 onwards. The CRS provides sufficient information on persons in the P35 files for reference year 2005 onwards.

Page 172 of 199

Investigating worker flows between sectors

- Sectoral flow of workers in context – job churn

Looking at volume of work by sector (figure on top left), the largest and second largest sectors are the Wholesale and retail trade sector (G) and Manufacturing sector (C). The drop in employment since 2007 in the Construction sector (F) is also very apparent falling from approx 175,000 person years in 2007 to under 100,000 person years in 2009. For other significant sectors lesser declines in volume of employment are observed, commencing in 2007 for Manufacturing sector (C) and in 2008 for the Wholesale and retail trade sector (G) and the Accommodation and food sector (I). The figure at top right presents the absolute job churn figures or the movement of workers between firms above and beyond that required to satisfy the movement of jobs (job creation and job destruction) while the near left figure presents the job churn figures standardised as rates. The largest sector in terms of volume of work, Wholesale and retail trade (G) is also the sector providing the largest number of opportunities with respect to the movement of workers. The Accommodation and food

Page 173 of 199 sector (I) provides more opportunities for job movers/seekers than those in Manufacturing (C) despite being a significantly smaller sector in terms of volume of work. In fact when at churn rates, Accommodation and Food (I) has a significantly higher rate than all other sectors with the Wholesale and retail trade (G) having the second highest churn rate. The sectoral analysis of churn rates is further evidence to the assumption that job churn is pro-cyclical with decreases in the job churn rate evident in all sectors.

- Sectoral flow of workers in context – re-employability of separations Table 1 Analysis of separations4 by year and whether a new employment record was found

Primary separations No New employment All New employment Number % Number % Number % Business economy excluding activities of holding companies (B to N,-642) 2006 156,572 32 483,788 100 327,216 68 2007 163,820 31 532,238 100 368,418 69 2008 205,006 35 586,201 100 381,195 65 2009 281,253 47 600,740 100 319,487 53 Business economy services excluding activities of holding companies (G to N,-642) 2006 113,296 31 362,803 100 249,507 69 2007 112,657 29 395,016 100 282,359 71 2008 137,166 32 431,906 100 294,740 68 2009 187,274 43 438,509 100 251,235 57 Industry (B to E) 2006 19,053 36 52,830 100 33,777 64 2007 22,152 38 57,998 100 35,846 62 2008 24,416 41 59,436 100 35,020 59 2009 35,027 51 68,436 100 33,409 49 Construction (F) 2006 24,223 36 68,155 100 43,932 64 2007 29,011 37 79,224 100 50,213 63 2008 43,424 46 94,859 100 51,435 54 2009 58,952 63 93,795 100 34,843 37

When looking at the reference year 2006 in table 1 above, of the 484,000 separations in the business economy for the previous year, 68% or 327,000 were identified as being employed in some sector whether in the business economy (B to N, - 642)5 or not. For 2009, this ‘re-employability’ figure of 68% had fallen to 53%. The drop in re-employability however is significantly greater for those in the Construction sector (F) compared to other sectors with the re-employability figure falling from 64% in 2006 to 37% in 2009.

4 Note where a person is identified as having more than one separation, only the separation with the highest number of weeks paid is counted. 5 Note sectors A and O through U are excluded from the business economy analysis in this paper. The classification used is based on published Business Demography data for CSO, Ireland.

Page 174 of 199

- Sectoral flow of separating workers

Table 2 Sectoral flow of separations in business economy finding re-employment

B to N,-642 Sector C Sector F Sector G Sector I Sector N Number % Number % Number % Number % Number % Number % Business economy excluding activities of holding companies (B to N,-642) 2006 277,696 100 26,175 9 44,291 16 66,045 24 40,110 14 34,996 13 2007 315,185 100 28,565 9 47,820 15 74,205 24 44,253 14 41,540 13 2008 325,051 100 28,114 9 39,467 12 80,333 25 48,046 15 47,163 15 2009 270,112 100 25,792 10 22,024 8 78,360 29 43,788 16 37,207 14 Manufacturing (C) 2006 26,461 100 7,793 29 4,125 16 5,385 20 1,726 7 2,655 10 2007 27,391 100 8,755 32 3,684 13 5,370 20 1,779 6 2,864 10 2008 27,492 100 8,290 30 2,613 10 6,021 22 1,879 7 2,879 10 2009 26,175 100 10,485 40 1,532 6 6,394 24 1,794 7 2,522 10 Construction (F) 2006 40,398 100 3,005 7 24,765 61 3,036 8 1,392 3 4,142 10 2007 46,561 100 3,417 7 28,183 61 3,761 8 1,785 4 5,153 11 2008 46,995 100 3,997 9 24,866 53 5,037 11 2,397 5 5,633 12 2009 30,493 100 2,772 9 13,522 44 3,706 12 2,177 7 4,260 14 Wholesale and retail trade; repair of motor vehicles and motorcycles (G) 2006 69,722 100 5,365 8 5,175 7 31,591 45 7,666 11 6,876 10 2007 74,930 100 5,400 7 4,689 6 35,564 47 7,716 10 7,449 10 2008 76,874 100 5,137 7 3,320 4 37,623 49 8,745 11 7,545 10 2009 72,824 100 4,860 7 1,751 2 42,571 58 7,479 10 6,320 9 Accommodation and food service activities (I) 2006 50,730 100 2,784 5 2,704 5 11,087 22 22,407 44 4,994 10 2007 56,349 100 3,039 5 2,490 4 12,378 22 25,113 45 6,037 11 2008 55,831 100 2,899 5 1,825 3 12,683 23 26,066 47 5,607 10 2009 43,975 100 1,732 4 962 2 9,002 20 23,590 54 4,157 9 Administrative and support service activities (N) 2006 40,526 100 3,873 10 4,102 10 7,251 18 4,138 10 10,654 26 2007 46,463 100 4,172 9 5,059 11 7,973 17 4,644 10 12,929 28 2008 54,437 100 4,250 8 3,966 7 9,428 17 5,382 10 18,826 35 2009 39,872 100 2,842 7 2,160 5 7,384 19 4,828 12 13,365 34

Table 2 describes the flow of workers between sectors in the business economy over time in absolute and percentage terms. Over the period 2006 – 2009 generally for each sector those changing jobs are more likely to take a new employment in the same sector, for example, the percentage finding re-employment that do not change sectors increases from 44% to 54% in the Accommodation and Food Sector (I), 29% to 40% in Manufacturing (C) and 45% to 58% in the Wholesale and retail trade sector (G). The exception is the

Page 175 of 199 Construction sector which shows a decrease from 61% to 44% of workers finding re-employment in the same sector over the period 2006 to 2009. As the re-employability of Construction workers in the construction sector fell in 2008 and 2009 there was an increase in the proportionate flow of workers from the Construction sector into the Retail and Wholesale sector (G) up to 12% in 2009 from 8% in 2006, the Accommodation and food sector (I) up to 7% in 2009 from 3% in 2006 and the Administration and other activities sector (N) up to 14% in 2009 from 10% in 2006. In general the Wholesale and retail trade sector (G) is identified as the biggest recipient sector of cross sector flow of workers from other sectors. This is followed by Accommodation and food sector (I). The biggest cross sector flows happen between these two sectors, (G) and (I), with between 20% and 23% of re-employed workers from the Accommodation and food sector (I) finding re-employment in the Wholesale and retail trade sector (G) in any year and a reciprocal percentage flow of between 10% and 11% over the period 2006 and 2009. These are the two largest sectors in terms of job churn identified earlier.

Note the difference in the total number of primary separations in the business economy finding re- employment (eg Year 2006 = 327,216) in table 1 and the number of primary separations in the business economy finding re-employment (Eg year 2006 = 277,696) in table 2 is explained by those separations that find employment outside the business economy (i.e, in sectors A and O through U) when considering those separations that find re-employment.

Concluding remarks This paper has presented summary statistics from the Job Churn project at CSO with a particular focus on those leaving jobs and on whether and where they go back into employment. Detailed statistical information from this project is available through CSO online databases at http://www.cso.ie/px in order to facilitate further exploration by researchers and policy analysts. This detail includes

- An economic activity breakdown (150 codes) as per Business Demography system across all datasets - Job churn components described in this paper for each economic activity code and employment size class - Age, sex and economic activity breakdown for Hirings, Separations and Job Stayers (measures also include Employment records, Value of reckonable pay, Volume of Work) - Separations analysis (as per table 1) by economic activity, whether re-employed or not (measures also include mean weekly reckonable pay from separating employment) - Sectoral flow analysis (as per table 2) of those separations finding re-employment categorised by whether mean weekly reckonable pay increased or not.

Potential uses of this information include (but are not constrained to)

- Informing on the demographic structure of employees by sector - Providing input to labour costs analysis - Providing information on gender pay gaps - Identification of sectors providing job opportunities (quality of those jobs in terms of pay) - Contributing to longer term evaluation of jobs policy

Further enhancements of the information provided in the online database could include - Geographical flow of workers through looking at county/location of employer - Breakdown of analysis into contracting and expanding firms - Analysis by Country of Ultimate Controlling Interest (UCI) – foreign ownership - Investigation of exporting enterprises

Page 176 of 199 The work undertaken in this project is an illustration of the untapped potential hidden in administrative data systems across Public Authorities. The project also demonstrates the considerable added statistical value that is available through the linking of such data sources.

The project to date has not explored the further potential value that can be derived through the use of record linkage and integration techniques to combine survey data with the linked employer employee datasets created as part of this project. However the project team recognise that there are significant opportunities through the deployment of such techniques and methodologies.

Bibliography Bassanini, A., & Marianna, P. (2009). Looking inside the perpetual motion machine: job and worker flows in OECD countries. Retrieved from http://www.oecd.org.

Burgess, S., Lane, J., & Stevens, D. (2000). Job Flows, Worker Flows and Churning. Journal of Labor Economics , 18 (3).

Fox, R. (2009, June). Job Opportunities in the Downturn. Retrieved March 15, 2011, from http://www.fas.ie/NR/rdonlyres/9ABC5EE1-CF20-4AA5-ACA4-C5B81DD9FE5E/793/jobsdownturn96.pdf

Guertzgen, N. (2007). Job and Worker reallocation in German establishments: the role of employers' wage policies and labour market equilibriums. Discussion paper, Centre for European Economic Research, Mannheim.

Ilmakunnas, P., & Maliranta, M. (2001). The turnover of jobs and workers in a deep recession: evidence from the Finnish business sector. Helsinki School of Economics and Business Administration; The Research Institute of the Finnish Economy. Helsinki: The Research Institute of the Finnish Economy.

Li, D. (2010). Job reallocation and labour mobility among heterogeneous firms in Norway. Working Paper, Ragnar Frisch Centre for Economic Research.

Page 177 of 199 The system of short term business statistics on labour in Italy. The challenges of data integration

C. Baldi, D. Bellisai, F. Ceccato, S. Pacini, L. Serbassi, M. Sorrentino, D. Tuzi

Istat DICS/DCSC/OCC, Rome, Via Tuscolana, n.1786 ([email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected])

Abstract: Italy produces labour market short term statistics both for national releases and for EU (STS, LCI and JV) regulations through a system of three surveys: the census monthly survey on Large Enterprises (LES), the quarterly sample survey on job vacancies and hours worked (VELA), and the survey on employment, wages and labour cost (OROS) mainly based on social security data. This paper describes the rationale behind the integration of the three sources into a system and its maintenance over time. Record linkage is used to integrate administrative and survey data both for the definition of the current target population and for editing, imputation and grossing up. Aims of the system are to ensure timely data, consistency among series and over time.

Keywords: data integration, consistency

1. Introduction

Italy produces short-term business statistics on the labour market through a system of three sources: the census monthly survey on large enterprises (LES), the quarterly sample survey on job vacancies and hours worked (VELA), and a survey on employment, wages and labour cost (OROS) mainly based on social security data. The system has been set up during the last decade under the pressure of the new EU regulations. At the beginning of 2000 Italy produced monthly data on employment, hours, wages and labour cost through the LES survey which covered, on a census basis, the population of firms with at least 500 employees. These nationally released indicators on the population of largest enterprises provided very timely information on the evolution of labour input while the task of producing more encompassing indicators was carried out by the National Accounts. The approval of the Short Term Statistics regulation (STS - 1998), the Labour Cost Index regulation (LCI - 2003) and the Job Vacancy statistics regulation (JV - 2008) has imposed an adaptation of the statistics production to comply with the standards required in terms of new target indicators (other labour costs, jobs vacancies) and population coverage (all firms with employees). Instead of a new large-scale sample survey, discouraged both by the considerable financial costs and the burden on enterprises, ISTAT started the OROS Project to exploit the social security (INPS) registers representing a low cost source of data on employment, wages and labour costs on the whole population of enterprises. Since INPS data did not contain any information on job vacancies and hours worked, except for paid time, a survey (VELA) was launched to collect information on these two variables. Due to budget constraints, this new survey was limited to the population of

1 Page 178 of 199 firms with at least 10 employees with take all strata for those over 500 employees. To limit the burden on enterprises the information on hours worked is requested only to the firms not responding to the LES survey, while, in order to check the data on vacancies, the questionnaire does not exempt these firms from providing information on jobs. In other words all the three sources contain information on jobs on the large enterprises (LEs); OROS and LES contain information on the labour cost variables on the LEs; OROS and VELA overlap for jobs on small and medium-sized enterprises (SMEs). All three sources cover firms belonging to the private business sector excluding agriculture (sections B to N of NACE Rev.2) and their economic activity code is mainly drawn from the statistical business register (ASIA). The LES-OROS-VELA system has the objective of producing consistent quarterly estimates on jobs, wages and labour costs, hours worked and job vacancies on the population of firms with at least one employee, while continuing to produce the monthly survey figures for LEs. However, for the above mentioned constraints the statistics on job vacancies and hours worked are limited to the firms with at least 10 employees. Figure 1 shows with respect to the three size classes subpopulations which sources are used for which variables. The pillars of the system are: a) OROS+LES which is used both as the current quarter population frame and as the census based source of information for average quarterly jobs, wages and other labour costs; b) VELA+LES which is used as the sample based source of information for jobs at the end of the quarter, job vacancies and hours worked.

Figure 1: The integrated system: sources, variables and coverage.

Variable

Total Labour Cost (TLC) Jobs (J) Hours worked (HW) Job vacancies (JV) Siz

LES VELA 500-h

VELA OROS Target 10-499 VELA 1-9

Although not coherent in all aspects, the system has some interesting features. It provides quarterly indicators on employment, wages and labour cost on a census basis. This census of a quarterly up-to-date population of all the firms with employees represents a substantial improvement for short-term statistics in Italy since the traditional sample surveys are all based on the Business Register. The delay of 15-24

2 Page 179 of 199 months of the latter with respect to the current quarter implies difficulties in measuring the changes related to business demography. Another relevant aspect of the OROS- LES-VELA system is that the indicators produced are (internally) consistent for the overlapping firms with the LES monthly ones. Moreover, the quarterly indicators on job vacancies and hours worked are consistent with the estimate of jobs derivable from the OROS+LES subsystem on the population of firms with at least 10 employees. To preserve the consistency with the employment totals, the estimates of job vacancies and hours worked are obtained by reweighting the sample data to the portion of the current quarterly population with at least 10 employees. One of the final outputs of this integrated system is the Labour Cost Index, described in paragraph 2, which is subject to strict timeliness constraints. Paragraph 3 illustrates the methodology used to build the subsystem OROS+LES with particular emphasis on the construction and maintenance over time of a unified list of enterprises. Paragraph 4 describes the procedures of micro integration needed to build the subsystem VELA+LES and the calibration to the OROS+LES universe. Some concluding remarks close the paper.

2. The LCI as an integrated output from the system

The LCI is a short-term indicator measuring the quarterly changes of the hourly labour cost and its single components (wages and salaries and other labour costs). Its transmission is due by 70 days from the end of the reference quarter and it is used by Eurostat to compile the aggregated Euro indicator on labour cost. The Italian NSI can nowadays satisfy the LCI regulation combining coherent and harmonized information on businesses produced by the LES-OROS-VELA integrated system on labour market statistics. Figure 2 gives a picture of the flow characterizing the system, from the inputs to the production of the main outputs, through the interrelation of three subsystems. Focusing on the LCI compilation, the flow chart illustrates how, starting from the data collection, the three surveys go through specific phases aimed at combining micro data and variables, up to the production of coherent indicators, both on labour cost and labour input variables. To estimate the LCI, the hourly total labour cost indicator ( hwTLC q ), with reference to quarter q, is derived as follows: TLC q q ⋅ q q jTLCOROS + LES JOROS + LES hwTLC == ∗ (1) q q ⋅ q THW jTHW + LESVELA JOROS + LES q q where jTLCOROS+LES is the per-capita indicator on total labour cost and jTHW + LESVELA is the per-capita indicator on hours actually worked. The three sources reconciliation is q guaranteed by the number of jobs ( J +LESOROS ) drawn from the OROS+LES subsystem and used by VELA as auxiliary variable for the estimation of hours worked (and job vacancies). Dividing hwTLC q by the annual average of the same indicator calculated in the base year and applying the chain-linked Laspeyres formula, the LCI is finally obtained (Ciammola et al, 2009). Figure 2 shows the three lines along which the whole production process is completed moving through more than one phase of integration: LES data versus OROS+LES data, OROS+LES data versus VELA+LES data. The scheme puts in evidence how reaching

3 Page 180 of 199 the output depends on the strict keeping of the scheduled deadlines in a short-term context (right bar in figure 2). At the beginning of the process, the validated monthly LES data are available at about 58 days from the end of the reference month. Once a year, with the release of the indicators referring to the first month, LES data for the whole previous year are revised in order to take into account late respondents and other updated information. LES data, for the three months of the quarter, are the first input for the system. The OROS project, whose main goal is to cover the SMEs subpopulation, exploits administrative data. The source refers to the monthly contribution declarations that employers with at least one employee have to submit to INPS (Rapiti et al., 2010). Each quarter two main data sets are acquired: a set of preliminary data for quarter q, available after 45 days from the end of the quarter, that gives an almost complete coverage of the total OROS population, and is used for the preliminary estimates, and a final version of the data for quarter q-4. This second set is the basis to produce final estimates and differs from the preliminary version due to the late reporters and some measurement errors that have been corrected meanwhile by INPS. Given the administrative nature of the source, about 15 days are needed to make the data usable for statistical purposes, when a non-trivial preliminary phase of checks and computation is carried out1 (Congia et al., 2008). At the same time, some structural information that contribute to the definition of the estimation domain (NACE code, etc.) is drawn from other statistical and administrative sources (Business Register, Tax Register etc.) and matched with OROS microdata using the fiscal code as unique linking key. After the calculation of the statistical variables, checks are implemented at different levels in order to identify possible anomalous values, both at monthly and quarterly frequencies. These checks are based on selective editing rules that exploit cross-sectional and longitudinal relations among the analyzed variables. At 60 days from the end of quarter q, the OROS microdata are substituted for by the LES survey data for the overlapping firms. The availability of LES data at 58 days from the end of the quarter starts the record linkage and micro-integration process in order to single out from OROS the common sub-population of firms to avoid double counting. This process implies taking into account variables harmonization and linkage issues (two days) and produces the quarterly OROS+LES microdata (see §3). The almost complete coverage of the target population implies that the basis for estimation is a sum of the quarterly OROS+LES data. The estimates obtained turn out to be very accurate for the ratio variables (per-capita wages, labour cost and other labour costs). However the jobs level estimate has to be adjusted to correct for the incompleteness of the OROS data file due to the late reporters. The method currently used is a macro approach that exploits past revision errors and other longitudinal information. It is carried out on data aggregated at the two-digit level of the NACE. The estimates of the total number of jobs are usually available 60-62 days after the end of the reference quarter. The VELA survey completes the short-term statistical system with the production of the job vacancy variable and the extension of the hours worked coverage. Each quarter the data collection ends around 55 days after the end of the reference quarter, but a subsequent recalling phase implies that late responses are accepted until 59-60 days after the end of the quarter. The first phase of the integration OROS+LES versus

1 The correct exploitation and the translation of these data into statistical information entail coping with frequent changes in laws, regulations and other technical aspects regarding social security contribution. The necessary metadata are not made functionally available by INPS but have to be collected and regularly updated into an electronic format.

4 Page 181 of 199 VELA+LES starts with the OROS+LES combined data availability at 60 days. These microdata are used both as auxiliary variables for editing and imputation of the SMEs jobs collected by VELA and as microdata source for the LEs sub-population estimates of jobs and hours worked (see §4). Both item and unit non-responses to the VELA survey are imputed on LEs while only item non-responses are imputed on SMEs. The second phase of the integration becomes possible when the macroadjusted estimates of jobs are provided by OROS+LES at 60-62 days. These estimates are used as known totals in the calibration procedure to obtain the weights for grossing up to the total population with more than 10 employees (see §4). After calibration and the calculation of the aggregates on the study domains, both job vacancies and hours worked estimates are validated through micro and macro checks. At this step the priority is given to the compilation of the job vacancy indicators that needs to be transmitted to Eurostat within 70 days from the end of the quarter. Figure 2: The integration flow

LES Time schedule Raw monthly data: OROS (days from the end (census) of the reference quarter) Admin monthly data m1, m2, m3 (census) 4 VELA m , m , m 1 2 3 q, q-4 Raw quarterly data Register maintenance (sample) Recall, E&I q 5 5 Pre-treatment, Retrieval, E&I Recall, Pre-treatment

LES data: 5

m1, m2, m3

Record linkage and microintegration LES indicators

m (i) OROS + LES Record linkage, 6 data microintegration q and E&I

Macro adjustment VELA+LES data 6 Total q Jobs labour q cost Calibration & q Grossing up 6 Hours Forecast Worked LCI HW q q-1 q Job 7 Vacancies q

Hours

Worked 7 q

In this integrated system, the timeliness requirements for LCI purposes face some difficulties. The hours worked from VELA+LES are in fact provided at about 75 days

5 Page 182 of 199 from q and a flash estimate is not available as it is for quarterly job vacancies (see §4). Nevertheless, since LES data on hours worked are available at 58 days, the current per- capita hours worked can be forecast, using the available time series information from the VELA+LES indicators (Ceccato et al., 2011). In order to use LES as a leading indicator in the forecast, a quarterly aggregation procedure on monthly data and a harmonization treatment are needed, requiring four days of work. The VELA+LES hours worked available at 75 days are used to revise the LCI in the next release, because of their non-negligible effects on the indicator. For the sake of completeness, it is worth stressing the interrelations between the three subsystems as regards revisions and their effects on the LCI quality. The LES revisions produce a “domino effect” on the other two processes: they are included in the OROS+LES quarterly data in the first quarter release (at June) of the year (y), and affect the estimates of the four quarters of the previous year (y-1). Furthermore, due to the availability of a final version of the OROS microdata, an additional cause of revision affects each quarter the q-4 TLC estimates. On the other hand, VELA introduces yearly revisions referring to: the four quarters of the previous year (y-1), to acquire the LES revisions; the four quarters of the year before the previous one (y-2), to take into account the q-4 revisions of the jobs estimates. The final outcome on the LCI is a combined and more extended revision effect, driven by the single subsystems: the quarters of the current year are revised each quarter (both via annual weights, and because of the availability of the hours worked on q-1); once a year all the eight quarters of the previous two years are revised. The relative long time period concerned by revisions and the unavailability of the last observed data affect the quality of the provided LCI indicator. The next improvements of the system integration should start by tackling these two aspects.

3. The administrative and survey data integration

The integration of OROS administrative data with the monthly large firms survey defines first of all the current target population frame for the LES-OROS-VELA short term business statistics system. This integration was a necessity until 2004, justified by the fact that large firms were underrepresented in the early reporters (transmitting the administrative data through an electronic form) used for the OROS preliminary estimates (Baldi et al., 2004). Afterwards, when data supplied by INPS were almost complete following a legislation change that obliged all the firms to declare the social security contributions electronically, the integration became mainly a choice. Since large firms have a relevant influence on the estimates (around 1,400 units in NACE sections B to N, accounting for about 20 per cent of all employee jobs), the survey data, collected and processed monthly by a group of specialized workers with continuous contacts with the enterprises, guarantee a higher quality of the information. The integration followed a feasibility study which had found a good degree of comparability between the target variables of the two sources. Of course a harmonization is necessary in particular for jobs, which are measured with respect to different definitions (see Amato and Pacini, 2004a). In fact, while LES collects the number of jobs at the beginning and at the end of the month, OROS measures the quarterly average of jobs having at least a hour paid in the month. Moreover, since the publication of monthly figures has been possible only through the LES survey, this type of integration would have opened the possibility to provide

6 Page 183 of 199 quarterly figures by size classes consistent with the monthly LES ones. This choice implied that for the LES survey, whose main objective in the past was the release of aggregate figures, the production of high quality micro data and meta information became a new important aim rather than being just a by-product. The integration procedure aims at replacing admin with survey data for the overlapping enterprises. The main operation consists in identifying and excluding from the OROS source the enterprises belonging to the LES survey. In this process four specific features are noteworthy: 1. the linkage of the two sources is not a one-off operation but rather a process that must be carried out each quarter; 2. the dynamics of large firms due to corporate events must be monitored in order to guarantee the correct identification of the LES units in the OROS population; 3. to provide high quality data the acceptance threshold of mismatch errors has to be close to zero: the integration of the two sources must be carried out very carefully to avoid any misalignment and duplication; 4. considering the release calendars of LCI and JVS and the availability dates of OROS and LES data, the time allowed for the integration process is about 2 days (Figure 2). To take into account all these aspects the integration process proceeds in two distinct steps. The first one, quite effort and time consuming, is carried out every five years when the census list of large enterprises is defined on the basis of the Business Register and administrative units in the OROS current population, with reference to the introduction of the new base year (STS, Reg. CE n.1165/1998). Starting from this LES list, a residual list of OROS units must then be identified. Theoretically speaking, in this year it would be possible to perform the integration procedure removing from OROS all the firms with more than 500 employees, and replacing them with LES data. In practice, a record linkage is needed because of the importance of defining a list to drag along beyond the base year in order to reduce the quarterly integration costs. This second phase consists in maintaining each quarter the complementarity of the two lists considering the LES list of the base year as a fixed panel. The panel definition implies that: no unit gets in or out of the lists as long as the base year remains the same, even if its size decreases crossing the 500 employees threshold; and all demographic company changes have to be considered to guarantee a longitudinal panel. Therefore, for example, new units resulting from a split up of a panel enterprise are included, as those deriving from a merger between panel and non panel firms. Hence, the base activity to built the two complementary lists is performed in the base year (first step) mainly through an exact matching. Although the OROS and LES data have not the same statistical identifier, the former having an administrative code and the latter an internal survey code, the match is possible using the fiscal code as unique business identification number (BIN). To be used as key linking variable, this code must undergo a preprocessing treatment in both sources to guarantee a formally correct and never missing variable. Despite this accurate pre-matching process, some problems in using the BIN equality function as unique linkage key still remain, causing no matched and false matched pairs (Fig. 3). The first event occurs when two records belonging to the same unit are not linked, while false matches occur when the results of the function based on the BIN equality are positive but records belonging to substantially different units are linked (linkage key is not perfect and/or exhaustive). Two of the most important reasons explaining this phenomenon are the different rules for updating the register in the two sources together with the frequent occurrence of

7 Page 184 of 199 company changes which weaken the usefulness of the fiscal code as BIN. Information to follow the enterprises over time is needed but not available in the administrative data so their management is almost completely delegated to the LES survey experts. This information is stored in a data base of events where specific rules about registration and statistical treatment of business longitudinal changes are adopted taking into account the features of the LES panel (see Amato and Pacini, 2004b).

Figure 3: The record linkage between LES and OROS data.

OROS LES (firms 1+ empl.)

true matched =1 if (JLES-JOROS) matched <=abs(threshold) =1 if (BINLES,=BINOROS) Î f(JLES,JOROS) false matched

f(BINLES, BINOROS) =1 if (JLES-JOROS) >abs(threshold) no matched Les =0 if (BINLES <> BINOROS) Clerical review using: - other sources - residual OROS large units not matched with LES

OROS + LES (firms 1-499 empl)

In order to detect false matches, after the automatic linkage, another indicator function is considered, based on the difference in term of jobs between the two units with the same BIN. An acceptance threshold is established to take into account the slight residual difference in jobs after the harmonization of the LES ones. For specific sectors, characterized by a high turnover that implies wider differences in the two sources jobs definition (see §4), this threshold is higher. The matched units with an absolute jobs difference above the threshold and no matches are passed to a clerical review carried out using different sources of information such as the Statistical Business Register, some on line administrative sources or firms web sites2. To avoid any further mistake a test is carried out listing and checking if there are residual large firms in the OROS data not matched with LES ones.

2 To find out the correct match also a record linkage based on other variables related to the identification of the units (such as firm name, addresses, telephone numbers, etc.) could be considered. At the moment this solution has not yet been implemented, due to the small weight of these residual units compared to the high cost of standardization of alphanumeric variables.

8 Page 185 of 199 The list of units, thus created, is updated quarterly in a very short time (second step). The linkage scheme described above is applied only to the panel firms which have undergone changes identifiable through both new BINs and new mismatches in the jobs function. This second step allows to rapidly keep a dynamic integration between the two sources. The efficiency and efficacy of the enhancement of the list is tightly related to the capability of LES survey to trace corporate/company changes, also in consideration of the fact that in the survey late respondents and missing data increase exactly for the units undergoing these changes. After the identification of the LES units in the OROS source, the economic variables administrative data are replaced with survey ones. The availability in the LES survey of detailed wages and other labour cost components allows the calculation of the OROS survey target variables.

4. Implications for editing and imputation and grossing up in an integrated system

The OROS+LES data are subsequently integrated with those collected by the VELA survey in order to produce indicators on hours worked and job vacancies. The integration occurs at two levels: 1) at the micro level, the information from LES and VELA is combined to produce a unified sample dataset with the fields on jobs, job vacancies and hours worked filled for every unit which has responded to VELA and for all those belonging to the LES panel. Furthermore, this unified sample dataset is linked at the micro level with the OROS+LES population; 2) at the macro level, the thus obtained sample is reweighted to represent the portion of the OROS+LES population with at least 10 employees. The integrated sample dataset is obtained mainly through a deterministic record linkage based on fiscal code using the large enterprises list of the survey units created as shown in paragraph 3. The same technique is used also for the link with the OROS+LES population. The features of the micro integration procedure are different depending whether the units belong to the LES survey or not. For the first type of units the integration consists in acquiring the data on the number of jobs, job flows and hours worked, even for the units that have responded to VELA, from the LES survey (since they are carefully edited by expert personnel and imputed for all unit non responses), and add the job vacancy variable collected only by VELA. The acquisition poses no particular problems since the definitions of the variables are the same in the two surveys. In particular, for jobs at the beginning and at the end of the quarter the data collected by LES for the same dates are used. For hours worked during the quarter, the data collected (or imputed) by LES for the three months of the quarter are added up to a quarterly figure. For job vacancies, which are collected only by VELA, if a LES firm is a respondent in VELA and the difference between jobs as measured by the two surveys is limited, the figure thus collected on the vacancies to jobs ratio is multiplied by the LES end of quarter jobs to obtain the number of vacancies for that unit; otherwise, a hot-deck nearest neighbour donor imputation is carried out (see for example Chen and Shao, 2000). For the units not belonging to the LES survey, the integration plays a role in the editing and imputation of the number of jobs collected by VELA through the comparison with the OROS variable but considering that the two jobs variables are measured with

9 Page 186 of 199 respect to different definitions (§3). In fact, while VELA collects the number of jobs at the beginning and at the end of the quarter, the figure available in the OROS quarterly data is the average over the three months of the quarter of the monthly number of employees with at least one hour of work paid in the month. These two measures show a very high correlation and the distribution of their differences is sharply concentrated around zero (see Bellisai, Pacini and Pennucci, 2005a and 2005b). The method used to check for large differences is a variant of the resistant fences one (see Thompson and Sigman, 1999)3, which is applied within classes defined by economic activity and turnover to take into account the differences between the two jobs variables. Beyond the differences caused by record linkage or measurement errors in one of the two sources4, those related to definitions may occur particularly in firms with a high turnover of employees within the quarter. The construction of a turnover proxy is possible using the information collected by VELA on the number of hires and separations. The ordered values of the score function allow to split the data into three subsets according to an empirically chosen threshold which takes into account a cost-benefit assessment: • a critical flow, consisting of the records above the threshold, that will undergo an interactive check since they are classified as possibly contain influential errors; • a non-critical flow, consisting of records between the fences and the threshold, whose errors can be non-influential. These observations will undergo a subsequent automatic check of internal consistency and possibly will be treated automatically; • a flow of observations (those between the fences) reputed correct. The observations of the non-critical flow which have not passed the subsequent check and those of the critical flow that in the interactive phase have been reputed incorrect are set to missing and passed to an imputation phase. Here a hot-deck nearest neighbour donor imputation is carried out where the matching variable on which the distance is computed is given by the OROS jobs (which is reasonable given their strong correlation with VELA ones). The imputation is performed within classes defined by economic activity and firm size. The dataset resulting from the micro integration procedure is then grossed up to the population of units with at least 10 employees based on the OROS+LES data built as described in paragraph 3. The grossing up is carried out through calibration, with jobs as measured in this reference population as auxiliary variable. An exception to this rule is applied in the few cases of the firms (not in the LES panel) for which the difference between jobs as measured by VELA and OROS is above a threshold. In this case, to avoid the risk of unsuitably large or small grossing up weights, the number of jobs quarterly average as measured by VELA is used in place of OROS jobs as auxiliary variable.

3 In this case the score function is built using: the logarithm of the ratio between the OROS and VELA values, a magnitude adjustment exponent given by the maximum between VELA and OROS jobs and a lower (upper) fence given by the first (third) quartile minus (plus) a non-linear function of the interquartile range. 4 Measurement errors in the OROS micro data are typically related to late reporters in one of the three months of the quarter or to the fact that some categories of employees (such as workers in non- agricultural firms who carry out a work which is classified as agricultural) are not included in the OROS administrative data.

10 Page 187 of 199 The known totals of the auxiliary variable are based on the OROS+LES microdata, adjusted in a way that incorporates the macro correction, described in paragraph 2, adopted to overcome the problem of late reporters (see Ceccato et al., 2011). The calibration procedure is performed within cells defined by economic activity and firm size. The initial weight is the inverse of the inclusion probability multiplied by the response rate for the units belonging to the non-LES portion and a unit weight for the firms belonging to the LES portion (these units have been drawn with certainty and subsequently all the unit non responses are imputed). The calibration weights for the LES units will in general be slightly different from one (generally larger) as the weights adjust for the non-responses of the non-LES large firms. The editing and imputation and grossing up procedures described so far in this paragraph are used to produce the quarterly aggregate figures on hours worked and job vacancies on firms with at least 10 employees. However, the EU job vacancy regulation also requires the transmission of data at least on the B-N aggregate within 45 days after the end of the reference quarter. Because the LES data for the last month of the quarter and the OROS data for the quarter are not available in time to comply with this deadline, the above described procedures have been adapted for this purpose. In particular, the LES end of the second month of the quarter jobs are used as estimates of the LES end of quarter ones, and the OROS microdata on jobs on the previous quarter and the same quarter of the previous year (to take into account possible seasonal effects) are used in place of those for the reference quarter in editing and imputation and grossing up.

5. Concluding remarks

The Italian integrated system of administrative and survey data for the production of short term business statistics on the labour market has been implemented over the last 10 years to guarantee the compliance with EU regulations. There are of course unsolved problems and inefficiencies mainly linked to an integration process not designed before but realized ex post. Some improvements to increase the efficiency and the quality of the general system are currently being studied/implemented. For example a project is under way to include the questions on job vacancies in the LES questionnaire starting from 2012, in order to reduce both the statistical burden on large firms (which would no more receive the VELA questionnaire) and the costs for the Italian NSI. It is also expected that this integration in the data collection phase would lead to a higher quality of the disseminated indicators due to the more relevant role of the LES experts and the smaller need for sample data micro integration. Moreover, in the OROS survey experimentations on the micro imputation of missing units (late reporters) are carried out, with the aim to substitute the macro adjustment. This would increase the quality of the microdata used in the editing and imputation of VELA jobs and in the reweighting procedures. More complex would be to reach the timeliness in the production of hours worked indicators that would allow the use of the current quarter data on this variable in the LCI rather than their time series forecast. In fact, to achieve this goal, increases in the timeliness of each process are needed in a context where already resources are stretched and production times very tight. Also the fulfillment of the new European requests (such as, for example, the coverage of NACE Rev. 2 sections P to S) is a big challenge.

11 Page 188 of 199 Furthermore, the implications for the system of the revision policies of the single sources should be more carefully considered. Finally, the general Italian available statistical sources context and its evolution have to be considered. In recent months big changes are occurring for what concerns the availability of new administrative information with a good timeliness. The use of this new administrative sources for a yearly virtual business census could reduce the delay of some statistical register (such as, for example, the Business Register) and allow a redesign of the sample surveys. This would imply a general reorganization of the system described here.

References

Amato G., Pacini S. (2004a), L’Occupazione delle indagini GI e OROS a confronto, mimeo, Istat. Amato G., Pacini S. (2004b), Le retribuzioni delle indagini GI e OROS a confronto, mimeo, Istat. Baldi C., Ceccato F., Cimino E., Congia M.C., Pacini S., Rapiti F., Tuzi D. (2004) Use of Administrative Data to produce Short Term Statistics on Employment, Wages and Labour Cost, in: Essays, n.15, Istat. Bellisai D., Pacini S. e M.A. Pennucci (2005a), Analisi preliminare sui dati Oros e Vela – 1a parte, mimeo, Istat. Bellisai D., Pacini S. e M.A. Pennucci (2005b), Analisi preliminare sui dati Oros e Vela – 2a parte, mimeo, Istat. Ceccato F., Tuzi D. (2011) Labour Cost Index (LCI). Quality Report 2010. Eurostat Internal Report. Ceccato F., Cimino E., Congia M.C., Pacini S., Rapiti F., Tuzi D. (2011) I nuovi indicatori trimestrali delle retribuzioni lorde, oneri sociali e costo del lavoro della rilevazione Oros in base 2005 e Ateco 2007, mimeo, Istat Chen J., Shao J. (2000) Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–131. Ciammola A., Ceccato F., Congia M. C., Pacini S., Rapiti F., Tuzi D. (2009) The Italian Labour Cost Index (LCI): sources and methods, Contributi Istat, vol.8, Rome (http://www3.istat.it/dati/pubbsci/contributi/Contributi/contr_2009/08_2009.pdf). Congia M.C., Pacini S., Tuzi D. (2008) Quality Challenges in Processing Administrative Data to Produce Short-Term Labour Cost Statistics, in: Proceedings of Q2008 European Conference on Quality in Official Statistics, Rome (http://q2008.istat.it/sessions/paper/29Congia.pdf). Rapiti F., Ceccato F., Congia M. C., Pacini S., Tuzi D. (2010) What have we learned in almost 10-years experience in dealing with administrative data for short term employment and wages indicators? Paper presented at the ESSnet seminar Rome, March (http://www.ine.pt/filme_inst/essnet/papers/Session2/Paper2.4.pdf). Thompson, K.J., Sigman, R.S. (1999) Statistical Methods for Developing Ratio Edit Tolerances for Economic Data. Journal of Official Statistics, 15, 517-535

12 Page 189 of 199 Obtaining Statistical Information in Sampling Surveys from Administrative Sources: Case Study of Spanish LFS ‘Wages from the Main Job’

Javier Orche Galindo and Honorio Bueno Maroto National Statistics Institute of Spain e-mail: [email protected]

Abstract: In 2009 the variable ‘wages from the main job’ was added to the Spanish LFS. The information was taken from administrative registers (Social Security and Fiscal files), to avoid overburden on respondents and to improve the quality of the variable without increasing the costs. But the Spanish LFS does not ask the personal identification number of the respondents and the link with administrative registers is not immediate. The solution applied was to incorporate the PIDN (personal identification number) from the register of population matching the information for both, personal and location variables (names and surnames, addresses, date of birth, location of birth).

This PIDN assigned to the sample of employees in the LFS is used to link through the Social Security and Tax databases and incorporate the data on salaries needed to calculate the variable requested in the LFS. One intricate issue is to transform annual into monthly data referred to the reference week of the survey. Other problem is that no individual source has a complete coverage of all the employees. Consequently is was needed to combine the information from all the different sources in order to estimate the soundest wage.

Keywords: labour force survey, record linkage, population register, personal identification number, business identification number, micro integration, combination of sources, validation of sources, best estimation method.

1.The 3-dimensional record linkage strategy to assign PIDN in the LFS

1.1. The administrative personal identification number (PIDN) of wage-earner

The aim of this part is to describe in some detail the strategy followed to provide the correct personal identification number (PIDN) to the persons surveyed in the LFS in order to make possible the capture of administrative information in subsequent phases. Obviously, the PIDN can be used for linking to provide income information, as described in the second part of the paper or for any other administrative source of interest.

1.2. Persons to be search

The LFS in Spain interviews every year around 230,000 different persons with 16 and older years, distributed in 6 subsamples each of the four quarters of the year. This means that every quarter we have to obtain the PIDN of about 140,000 persons, even many of them have been interviewed in previous quarters. In the process described in this document we make a reduction of that number because we search every quarter only the ‘new’ persons that appears for first time, reducing to 28,000 the number of persons from whom must be obtained a Personal Identification Number (PIDN) in each quarter. These persons are searched in the Population Register database (called ‘Padrón continuo’).

1.3. The three dimensions

The searching in the Population Register is made selecting some criteria applied to the information, of every person, used to select candidates who could correspond to the LFS

1 Page 190 of 199 person. Since 2005 different attempts have been testing trying to improve the output obtained (see figure 1).

D1 = Personal Data (Name, birth date, birthplace) D2 = Residence Data (County, City, Street, building) D3 = Human Group (Identity group: people surveyed in the same household)

The traditional variables for searching are the ‘Personal variables’: name-surname, birth data, birthplace, all of them are what we called as a Dimension 1 (D1). The second Dimension (D2) arose when we consider the advantages of using Street-codes that allows a fine-tuned searching by building, increasing the probability of find the ‘correct’ person. At the same time, the experience of the attempts showed that searching of easy-found people (with few mistakes) is quicker using a battery criteria (‘waterfall searching mode’), and difficult-found people (with some mistakes in name, birth date or birthplace) are better-found using parallel mode. With moderates returns on both searching the improvement in the output is very high. In 1st quarter 2011 we obtained 99.1% of returned persons. The Criteria used to search candidates were as figure 2 shows. The return after confronting with the population register the 1st quarter was 31,550 candidates (out of 24,409 persons searched).

1.4. Record linkage process

With this file (1st quarter 2011) the process to assign the correct PIDN ran in six steps:

1. Distances: We calculate distances of four selected variables: Name, birth date, birthplace and residence. All the distances have been developed in our unit. The distances applied are based on ‘coincidences’ for letters (name, residence), or based on functions like ‘birth codes- distance’ and ‘date-distance’ (WMWEDD -‘weighted match and weighted exact date difference’) for birth date.

2. Select process: We select only one candidate, the best one within each criterion, for every person. That means the selection can be different if we have found a person in criteria1 or criteria4.

3. Segmentation: We make a segmentation of distances in order to separate the probability of being the correct person. For example, the group 100-100-100-100 (name, birth date, birthplace, residence distances) is the perfect probability, and we include in Level 10. But 100- 100-100-80 has a lower probability, and then we include in level 9.

4. Iteration: After selecting the best choice in every level, we created de database with persons already linked and make another selection of candidates. This iterative process in each level is the best performance, and it avoids bad links of very close people like family members, relatives or neighbours.

When we finish this process, emerges a set of results (see figure 3). Levels 10 and 9 has a high probability of be the ‘correct’ person so we accept them as ‘correctly’ linked. The lower scores are the less coincidence of the information between the LFS and population register. Under level 6 we reject directly the link. When the scores are 8, 7 and 6 we apply additional criteria to confirm or reject these doubtful candidates. This is the moment when we use the Third dimension: the Human Group.

5. Confirmation: These doubtful candidates are solved using information of the human group, people whom has been interviewed in the same address (same dwelling). We have the identification number of the group (IDGROUP), and the selected address of the every member of the group already linked. Note that the interviewed address can be different to the address where they are currently living.

With the selection of these members we are using implicitly the probability of been part of that group (see figure 4). If the members of the human group (located by the IDGROUP) has been

2 Page 191 of 199 selected in the same address, it means that this group live together with high probability. If we connect these two probabilities, namely the doubtful candidate and, at the same time, as being part of a human group (family, household, people living together), that live in the same address where all the members are already linked and confirmed, we can assume that we have found the correct candidate, and then we, after checking, confirm the link of this candidate as correct as well. As it can be guessed, this process is not free of exceptions, for this reason we repeat this process for different kind of human group, that means different probabilities (see table 5).

6. Manually searching: To end the process we search manually the rejected persons in the database of the Population Register.

1.5. Final results

The final output of this linkage process obtains 95.6% of automatically correct linked persons in 1st quarter 2011. But in coming quarters the aim is to reach 97-98%. At the same time the process described showed an important saving in process time, reducing the number of months involved to complete the process to 4 months, once the annual sample data is available (end of January year N+1 for survey in year N) (see figure 6). A main issue is the good quality indicators obtained. According to our analysis, we estimate in 0.9890 as a precision, with a match error of only 0.0111.

2. Case study of Spanish LFS variable “wages from the main job” (integrating administrative data into the data collection)

2.1. Requirements and conditions of the variable

The Regulation (EC) No 1372 / 2007 of the European Parliament and of the Council establishes the mandatory inclusion of the variable “wages from the main job” in the Community Labour Force Survey, amending accordingly the basic LFS regulation Reg. 577/1998. This will improve the analytical potential of the survey, introducing the level of wages as a classification variable in the analysis of the characteristics of the main job for the group of employees. The Commission Regulation nº 377/2008 stated that the variable must be coded into deciles and allowed the capture of the information to be provided through interviews or by using administrative records.

In the Labour Force Survey of Spain, after conducting several qualitative tests (the last one in 2003, financed by the Commission under grant number 2002-32100015), it was considered very problematic to include additional questions in the survey to request information on wages. The main concerns were the lack of reliability of the information obtained by interview on this topic (the studies carried out and the experience on income surveys showed that the respondents did not provide good quality answer on income) and that eventually the reluctance of respondents when answering income questions spread to the rest of labour status questions. Additional concerns were detected in telephone and proxy interviews. Both characteristics are very frequent in the Spanish LFS. Taking these problems into account, we decided to look into the possibility of obtaining the salaries information from administrative sources.

The main advantages deemed for using administrative records to estimate the variable were, first, that it would not increase the burden on informants and secondly, that the survey would not be affected by a lower response rate in whole. The principal drawback is that it takes more time to capture the data since it depends on when administrative records are available. Another possible drawback would be the necessity of adaptation in case of eventual change in the characteristics of such records over time.

2.2. Information gathering

Unfortunately and not surprisingly (otherwise the exercise would have been undertaken before), there is no administrative source that meets a suitable definition that can be managed in a

3 Page 192 of 199 straightforward way. What we found were several administrative sources having different methodologies and limited coverage. In trying to find the best estimate of the target variable, we had to obtain information from various administrative records, combining their data with the information of the LFS. Therefore, the estimation of the wage of main job is what is termed in the statistical literature as a "derived variable”1. Since the need for information can not be filled immediately by direct reference to the information available, this variable is obtained by linking different sources that provide the required information, if not with absolute precision at least with good approximation.

Following this methodology, two main sources of economic and labour information have been used to estimate the salary. On the one hand, information about affiliation and contribution bases to the Social Security System from records of the General Treasury of Social Security (Form TC-2). On the other hand,. the information on income from annual statements of withholdings and advance payments on account of personal income tax declared to the tax agencies (Form 190)2. Previously (part 1 of this paper deals with this issue), it was necessary to make up a procedure that allowed us to assign an (correct) identifier to each of the respondents in the survey in order to transfer and cumulate the needed information across the different sources. Some details about the processes followed are described below (see figure 7).

2.3. The link to the register of ‘Affiliations and Social Security Contributions’

The link to the General Treasury for Social Security files allows us to determinate the main job affiliation in the reference week of the survey and to get dual information:

 The Business Identification Number (BIDN) of the principal employer in the reference week. This BIDN will enable to continue linking with both, tax administration and social security contributions database.

 The main characteristics of the contract(s). Particularly, it is crucial the number of days worked to determine the monthly salary, either on the whole reference year for annual totals or referred to the month of the reference week for monthly amounts.

To do this, primarily the affiliations under special schemes for self-employment or those belonging to special trading agreements are excluded (these people are affiliated but they are not really working). Then, it is assigned the affiliation corresponding to the reference week. If there are several affiliations for the same worker in the week, one must be chosen as the ‘main one’. The job selected is that whose characteristics, i.e. activity of the establishment, duration of the contract, seniority in it, etc. resemble those declared in the LFS questionnaire. Once it has been established the affiliation for the main job in the reference week, Both the Business Identification Number (BIDN) of the employer and the affiliated number of days in the year and on the month of the reference week are allotted. Other affiliation circumstances that may affect the estimation of monthly salary based on annual total are also considered.

Some employees in the public sector, which are not the object of holdbacks from the Social Security system but they contribute through their own mutual funds must be dealt with specifically. Given the expected stability of their employment, it is possible to estimate the number of days worked in the year by information derived from the LFS questionnaire, although the Business Identification Number (BIDN) of the principal employer may not be available in Social Security databases. This job is assumed to be unique and at least the largest

1 On this issue the approach described by JK Tonder Register-based statistics in the Nordic countries. Review of best practices with focus on Population and social statistics and A. Wallgren A. & B. Wallgren Register-based Statistics. Administrative Data for Statistical Purposes has been considered.

2 State Tax Administration Agency (AEAT) and Navarre. In the period 2006-2009 it has not been possible to have data from the regional Basque Treasuries

4 Page 193 of 199 in terms of revenue for the employee. This hypothesis is validated after crosschecking with tax agencies.

Since contribution bases are recorded for each calendar month of the reference year, different calculations can be obtained:

 An estimate can be obtained through the social security contribution base of the month of the reference week. In this case the base is multiplied by the ratio between the number of days in the month and the number of days affiliated in the month of reference.

 The annual ‘average’, by estimating the total salary base of contributions in the reference year divided by twelve and multiplied by the ratio between the number of days of the year and the number of days in the same year affiliated with the principal employer in the reference week.

Some limitations in the calculation of the wages by this two methods are:

 Contribution bases have both maximum and minimum limits, which makes the estimation difficult, especially in the case of the maximum limit.

 It is not applicable to employees in mutual funds outside the General Social Security System, for example, public servants.

 There can be two different monthly data contributions. The contribution base for common contingencies does not include the wages for overtime so, whenever possible, we use the quota for work accidents and occupational diseases, which does incorporates the overtime.

2.4. The link with ‘Annual registration statements of income and deductions and income tax revenue on account’

The pair “Personal Identification Number” (PIDN) of the employee and the “Business Identification Number” (BIDN) of the employer is linked to the annual statements of income and deductions and income tax payments on account of tax agencies to get the "full annual performance ". This annual information (the only available in the Spanish Tax Administration) must be calculated in monthly estimates.

Once the link has been successfully achieved and the information obtained has been checked, a estimate of the monthly salary can be made by dividing the annual full return by twelve and multiplying by the ratio between the number of days in the year and the number of days in the same year affiliated on Social Security with the principal employer in the reference week. The following limitations must be noted in this third estimation method of the wage:

 We may have some extra component pay (severance payments outside the legally established, delays, etc.) included in the full work performance of the reference year that wouldn’t correspond to the targeted ‘monthly wage’ variable.

 This is an estimate of the wages for the whole year and not for the month of the reference week, and the working conditions may have been changed during the year in the same company (part time to full time or vice versa, change of occupation, etc.) which may affect the wage in other months of the year.

 The tax administration in Spain is split into different agencies that must be dealt with independently (the main source of information is the national tax agency, but there are four so called ‘foral’ administrations.

5 Page 194 of 199 2.5. Integration, editing and imputation

As described above, in many cases it is possible to estimate salaries by several methods using the information available in administrative records and LFS. This enriches the possibilities for editing. In the rare event of discrepancies between different methods (see figure 8), we must first determine what is the more suitable estimate of income among all those available and validate it as the best one. Thus, the estimated final salary is obtained through a combination of all sources used and do not correspond exactly to the information received by any one of them.

For those employees for whom it has not been possible to establish their salary from administrative records or whose estimate was not considered sufficiently reliable, an imputation is made using the distribution of wages by type of time (i.e. full-time or part-time) and the occupation (three-digit standard classification according to ISCO).

2.6: Encoding

Finally, the wages are sorted and coded into deciles from "01" to "10", corresponding to the decila "01" the group of 10 percent of employees receiving lower wages and to the decila "10" the group of 10 percent of employees who receive the highest salary (see tables 9 and selected graphics in figure 10). From the results by deciles, some interesting indicators can be calculated (see table 11).

3. Figures and tables

Figure 1: Historical view of searching and linking process in LFS-Spain

Figure 2: Searching criteria used in 1Q-2011 LFS-Spain

6 Page 195 of 199

Figure 3: Segmentation map of linkage process. LFS-Spain

Figure 4: Confirmation process of candidates selected using human groups variables (3rd dimension).

Table 5: Typologies of human groups used

Figure 6: Development in the process time and correct automatic ratio– LFS Spain

7 Page 196 of 199 Figure 7: Flow chart of the estimation process in the LFS variable: “Wages from the main job”

Social Security System Personal Income Tax Monthly statements and payments to the Annual statements to the tax agencies of General Treasure of Social Security withholdings and advance payments on (Form TC2) account (Form190) Personal Data + PIDN + BIDN Personal Data + PIDN + BIDN

Monthly Gross Days of Social Contribution Yearly Gross Wage Amount and Part Time Rate Wage Amount

ESTIMATION METHODS 1 & 2 ESTIMATION METHOD 3 (Monthly or Yearly Reference Period) (Yearly Reference Period)

Labour Force Survey Yearly sub-sample Employees ESTIMATOR SELECTION Score Method

VALIDATION Score Method

Personal data + Full time / Part time + IMPUTATION + Occupation (ISCO) According with the Main Job Characteristics

Population Register Monthly Gross Pay from the Main Job (MGPMJ) Variable

DECILES CODIFICATION 01 (Lowest) - … - 10 (Highest)

Personal data + PIDN LFS VARIABLE - INCDECIL

Variables for Record linkages: (1) Personal data of the employee (Name and date of birth) + (2) Personal Identification Number (PIDN) of the employee + (3) Business Identification Number (BIDN) of the employer. Variables for choose the best estimation method, validation and imputation: Labour market characteristics from the LFS: (1) Full time / Part time + (2) Occupation (ISCO coded at 2 or if possible 3 digit level).

8 Page 197 of 199 Figure 8: Social security vs. Tax agencies estimation (2009 – employees LFS subsample) Monthly Gross Pay from the Main Job Source 2: Tax agencies

Source 1: General Treasury of Social Security

Table 9: Deciles by full time / part time and occupation (ISCO - coded at 1 digit level)

2009 – employees LFS MONTHLY (GROSS) PAY FROM THE MAIN JOB (DECILE) N subsample 1 2 3 4 5 6 7 8 9 10 Total 32.747 3.082 3.089 3.196 3.135 3.157 3.254 3.371 3.346 3.527 3.590 ISCO 28.353 332 2.162 2.846 3.039 3.099 3.189 3.290 3.302 3.504 3.590 0 236 . 3 1 15 34 32 32 24 58 37 1 868 . 13 12 27 18 28 49 93 145 483 2 4.681 17 34 45 70 81 134 264 621 1.431 1.984 3 3.852 20 151 256 247 312 394 656 695 603 518 4 3.040 31 240 326 317 326 497 467 405 249 182 Full time jobs 5 4.680 98 606 820 740 564 490 502 386 350 124 6 334 4 48 68 57 51 60 19 22 5 . 7 4.213 16 159 367 668 819 664 581 487 305 147 8 3.066 14 188 296 341 406 527 470 408 306 110 9 3.383 132 720 655 557 488 363 250 161 52 5 ISCO 4.394 2.750 927 350 96 58 65 81 44 23 . 0 2 1 . . . 1 . . . . . 1 26 16 3 1 1 . 2 3 . . . 2 447 98 87 39 38 28 40 50 44 23 . 3 498 263 111 43 13 17 23 28 . . . Part time jobs 4 495 236 171 43 33 12 . . . . . 5 1.285 949 235 101 ...... 6 18 13 3 2 ...... 7 101 63 31 4 3 ...... 8 124 77 30 9 8 ...... 9 1.398 1.034 256 108 ......

9 Page 198 of 199 Figure 10: Selected graphics on decile main job wage. 2009 data for Spain

Figure 11: Average wages calculated from deciles by sex. Gender Pay Gap calculation.

2006-2009 series 2006 2007 2008 2009 Total 1.570,66 1.635,89 1.771,55 1.811,48 Males 1.724,31 1.796,86 1.961,31 2.015,79 Females 1.365,87 1.420,11 1.534,60 1.576,09 Gender Pay Gap 79,21 79,03 78,24 78,19

4. References

Official Journal of the European Union (2007). Regulation (EC) No 1372/2007 of the European Parliament and Council of 23 October 2007 amending Regulation (EC) No 577/98 on the organization of a sample survey the workforce in the Community. National Statistics Institute of Spain (2008). Labour Force Survey. Methodology 2005. Description of the survey, definitions and instructions for completing the questionnaire. National Statistics Institute of Spain (2008). Labour Force Survey. Methodology 2005. Variables in the subsample. Tonder JK (Coordinator) - UNECE (2007): "Register-based statistics in the Nordic countries. Review of best practices with focus on Population and social statistics. Wallgren A. - Wallgren B. (2007). Statistics Sweden. Register-based Statistics. Administrative Data for Statistical Purposes. Ed John Wiley & Sons, Ltd.

10 Page 199 of 199