Matching of Census and Administrative Data for Census Data Quality Assurance in the 2011 Census of England and Wales

Matching of Census and administrative data for Census data quality assurance in the 2011 Census of England and Wales Louisa Blackwell, Andrew Charlesworth, Nicola Rogers, Richard Thorne Office for National Statistics, [email protected] Office for National Statistics, [email protected] Office for National Statistics, [email protected] Office for National Statistics, c/o [email protected] Abstract This paper describes the role of administrative data matching in the quality assurance process for the 2011 Census in England and Wales. The Census provided a comprehensive snapshot of the population of England and Wales in March 2011 and presented a unique opportunity to explore and understand the quality and completeness of administrative sources. The innovative data matching methods and systems designed for this task are described. A flexible, reactive approach applied analytic methods that were appropriate to the research questions that arose during the census quality assurance process. Record matching and interpretation of the linked data drew on all of the available information from Census. Matching this rich information to administrative sources provided new insights, both empirical and theoretical, into the relationship between different administrative datasets. Keywords: Record linkage, data architecture, census validation 1. Introduction A review of the 2001 Census data quality assurance process concluded that administrative data could potentially have three uses in 2011: • As an auxiliary variable in coverage assessment and adjustment • At record level, in triple system estimation • To help identify, measure and correct for any bias found in the dual-system estimator 1 (White et al, 2006). If administrative data were used for both estimation and quality assurance, the independence of the quality assurance process could be compromised. This was avoided during 2011 Census processing by reserving the use of administrative sources for quality assurance, with the possibility of using them to calibrate census estimates in the event of a failure of the census field operation or if the coverage adjustment process did not provide robust and plausible estimates. The 2007 Statistics and Registration Service Act opened up new opportunities for ONS to access administrative microdata from other government departments for the purpose of population estimation. Through these and other provisions, the following were available at record level for census quality assurance: Patient Register, School Census (England and Wales), Higher Education Statistics Authority data, the Department for Work and Pension’s Migrant Worker Scan, Birth Registrations, Death Registrations, Electoral Registers and Valuation Office Agency data. These datasets were analysed for quality and completeness, and their formats were standardised. They were used in two ways. Firstly they were used at the aggregate level, as comparators for census counts and estimates of addresses, households and individuals. Often the comparison was made at low geographic levels, down to postcodes. Secondly, where the aggregate-level comparison was unable to explain a discrepancy, record matching was carried out. This report describes the methods, systems and processes used for the matching, provides an overview of results and the conclusions they inform. But first we outline some of the challenges to be overcome. These challenges largely stemmed from the awkward reality that the research questions to be addressed by matching could not be known in advance. 2. Building a matching methodology and architecture for Census QA and the imperative for a flexible approach Administrative data matching systems and methods needed to respond promptly to issues raised by the Census Quality Assurance (QA) process. This presented a number of conceptual challenges, especially as analysis requirements were unpredictable, generally had short deadlines and were based on data that were not uniformly available at the beginning of the QA process. Flexibility of data used, analysis geography, level of matching and interpretation in analysis were therefore fundamental to the design of data matching systems and methods. In addition, the security requirements of working with and storing record-level matched data had to be taken in to account. High quality record-level matching is a time-consuming process. Address matching in high priority areas ahead of QA helped to maximise the number of areas that could be matched during the live operation. 2.1 Challenge 1: Security risks All Census data processing took place within a secure, closed IT environment. Access to this environment and outputs from it were tightly managed and controlled. However, the security risk posed by matched record-level administrative and census data in combination is higher than that presented by the individual datasets. To mitigate this enhanced risk, the matching architecture stored the results of matching as pairs of anonymised identifiers. These 2 anonymised identifiers were created for all addresses and individuals in each source and are independent of any supplied identifiers such as name, address, National Health Service Number or date of birth. Lookups relating original and newly created identifiers were held separately from attribute data. In this way, the newly combined information was stored so that it could not be attributed to an individual or address without reference to the separately- held look-up. 2.2 Challenge 2: Uncertain analytic requirements It was impossible to predict in advance the issues that record matching would need to address. The geography or population sub-group under consideration would determine which administrative data should be used. The architecture therefore had to allow matching between all or just some sources, with capacity to add other new ones if they became available. Results were held at local authority level, numbering 348 in England and Wales, but could also be held for any geographic subset within the Local Authority. Using local authorities in this way provided a meaningful search space for record matching, while keeping processing speeds reasonable. 2.3 Challenge 3: Late availability and uneven quality of data Of the data available for matching, summarised in Table 1, only the Patient Register, Valuation Office Agency and Census Address Register were available from the start of the QA process. Census person data became available as Local Authorities were processed (mirroring the order in which QA issues were raised) while CCS and other Census information were only available late in the QA process. HESA, English and Welsh School Census and Births data were also only supplied to ONS once the QA process had begun since extracts pertaining to Census day were not immediately available. Electoral Register data were available for most local authorities but these were inconsistently formatted and required substantial cleaning and standardisation. A key requirement for the data matching architecture was the ability to incorporate new data if and when they became available. This paper focuses on matching between the Patient Register and the Census. However a key feature of the architecture was the ability to incorporate results from a ‘toolkit’ of matching methods. As different data sources contain different matching variables or are stored at different levels (address or individual), there was not a single method that could be applied to all data. 2.4 Challenge 4: the requirement for timely results The QA process involved the review and approval of 348 local authority estimates at a series of QA Panels. Some discussions at QA Panels demanded further analysis. Where issues could not be resolved using data at aggregate level, record matching was required. Data matching systems and methods were designed to respond quickly to these requests and processes were automated where possible. To reduce the turnaround time required for data matching projects, ahead of the QA process Patient Register addresses were matched to the Census address register in 37 local authorities. These local authorities were areas of high population churn, taking into account migration patterns since 2001. As Census processing got underway, they were prioritised by the expected delivery date for their processed person-level data, in anticipation of the order they would be considered by the QA Panels. Record matching for each of these 37 local authorities was suspended if they were approved by the QA Panel, and new areas not in the 3 original list of 37 were added as issues arose. By the end of the QA operation, record matching had been done in 55 local authorities. Identifying the more challenging local authorities and address matching in these areas ahead of the live operation allowed preliminary work to proceed in an intelligent way, and maximised the number of local authorities overall that could be matched. Included in the 55 local authorities were a number of ‘control’ local authorities that were expected to pose little enumeration challenge, for example because they had low levels of international and internal migration. These served two purposes. Firstly they provided context to results for more challenging areas. Secondly they provided a validation of the matching methods used. 2.5 Challenge 5: Keeping the scale of the matching task at a manageable level Some QA issues focussed on small geographic areas or population sub-groups, such as students in communal establishments

Matching of Census and Administrative Data for Census Data Quality Assurance in the 2011 Census of England and Wales

Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure

A Machine Learning Approach to Census Record Linking∗

Stability and Median Rationalizability for Aggregate Matchings

Report on Exact and Statistical Matching Techniques

Alternatives to Randomized Control Trials: a Review of Three Quasi-Experimental Designs for Causal Inference

Matching Via Dimensionality Reduction for Estimation of Treatment Effects in Digital Marketing Campaigns

Frequency Matching Case-Control Techniques: an Epidemiological Perspective

Package 'Matching'

A Comparison of Different Methods to Handle Missing Data in the Context of Propensity Score Analysis

Performance of Logistic Regression, Propensity Scores, and Instrumental Variable for Estimating Their True Target Odds Ratios In

An Empirical Evaluation of Statistical Matching Methodologies

The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insuranc