Quick viewing(Text Mode)

Matching of Census and Administrative Data for Census Data Quality Assurance in the 2011 Census of England and Wales

Matching of Census and Administrative Data for Census Data Quality Assurance in the 2011 Census of England and Wales

Matching of and administrative data for Census data quality assurance in the 2011 Census of England and Wales

Louisa Blackwell, Andrew Charlesworth, Nicola Rogers, Richard Thorne Office for National , [email protected] Office for National Statistics, [email protected]

Office for National Statistics, [email protected] Office for National Statistics, c/o [email protected]

Abstract

This paper describes the role of administrative data matching in the quality assurance process for the 2011 Census in England and Wales. The Census provided a comprehensive snapshot of the population of England and Wales in March 2011 and presented a unique opportunity to explore and understand the quality and completeness of administrative sources. The innovative data matching methods and systems designed for this task are described. A flexible, reactive approach applied analytic methods that were appropriate to the research questions that arose during the census quality assurance process. Record matching and interpretation of the linked data drew on all of the available information from Census. Matching this rich information to administrative sources provided new insights, both empirical and theoretical, into the relationship between different administrative datasets.

Keywords: Record linkage, data architecture, census validation

1. Introduction

A review of the 2001 Census data quality assurance process concluded that administrative data could potentially have three uses in 2011: • As an auxiliary variable in coverage assessment and adjustment • At record level, in triple system estimation • To help identify, measure and correct for any found in the dual-system estimator

1

(White et al, 2006). If administrative data were used for both estimation and quality assurance, the independence of the quality assurance process could be compromised. This was avoided during 2011 Census processing by reserving the use of administrative sources for quality assurance, with the possibility of using them to calibrate census estimates in the event of a failure of the census field operation or if the coverage adjustment process did not provide robust and plausible estimates. The 2007 Statistics and Registration Service Act opened up new opportunities for ONS to access administrative microdata from other government departments for the purpose of population estimation. Through these and other provisions, the following were available at record level for census quality assurance: Patient Register, School Census (England and Wales), Higher Education Statistics Authority data, the Department for Work and Pension’s Migrant Worker Scan, Birth Registrations, Death Registrations, Electoral Registers and Valuation Office Agency data. These datasets were analysed for quality and completeness, and their formats were standardised. They were used in two ways. Firstly they were used at the aggregate level, as comparators for census counts and estimates of addresses, households and individuals. Often the comparison was made at low geographic levels, down to postcodes. Secondly, where the aggregate-level comparison was unable to explain a discrepancy, record matching was carried out. This report describes the methods, systems and processes used for the matching, provides an overview of results and the conclusions they inform. But first we outline some of the challenges to be overcome. These challenges largely stemmed from the awkward reality that the research questions to be addressed by matching could not be known in advance.

2. Building a matching methodology and architecture for Census QA and the imperative for a flexible approach

Administrative data matching systems and methods needed to respond promptly to issues raised by the Census Quality Assurance (QA) process. This presented a number of conceptual challenges, especially as analysis requirements were unpredictable, generally had short deadlines and were based on data that were not uniformly available at the beginning of the QA process. Flexibility of data used, analysis geography, level of matching and interpretation in analysis were therefore fundamental to the design of data matching systems and methods. In addition, the security requirements of working with and storing record-level matched data had to be taken in to account. High quality record-level matching is a time-consuming process. Address matching in high priority areas ahead of QA helped to maximise the number of areas that could be matched during the live operation.

2.1 Challenge 1: Security risks All Census data processing took place within a secure, closed IT environment. Access to this environment and outputs from it were tightly managed and controlled. However, the security risk posed by matched record-level administrative and census data in combination is higher than that presented by the individual datasets. To mitigate this enhanced risk, the matching architecture stored the results of matching as pairs of anonymised identifiers. These

2 anonymised identifiers were created for all addresses and individuals in each source and are independent of any supplied identifiers such as name, address, National Health Service Number or date of birth. Lookups relating original and newly created identifiers were held separately from attribute data. In this way, the newly combined information was stored so that it could not be attributed to an individual or address without reference to the separately- held look-up. 2.2 Challenge 2: Uncertain analytic requirements It was impossible to predict in advance the issues that record matching would need to address. The geography or population sub-group under consideration would determine which administrative data should be used. The architecture therefore had to allow matching between all or just some sources, with capacity to add other new ones if they became available. Results were held at local authority level, numbering 348 in England and Wales, but could also be held for any geographic subset within the Local Authority. Using local authorities in this way provided a meaningful search space for record matching, while keeping processing speeds reasonable.

2.3 Challenge 3: Late availability and uneven quality of data Of the data available for matching, summarised in Table 1, only the Patient Register, Valuation Office Agency and Census Address Register were available from the start of the QA process. Census person data became available as Local Authorities were processed (mirroring the order in which QA issues were raised) while CCS and other Census information were only available late in the QA process. HESA, English and Welsh School Census and Births data were also only supplied to ONS once the QA process had begun since extracts pertaining to Census day were not immediately available. Electoral Register data were available for most local authorities but these were inconsistently formatted and required substantial cleaning and standardisation. A key requirement for the data matching architecture was the ability to incorporate new data if and when they became available. This paper focuses on matching between the Patient Register and the Census. However a key feature of the architecture was the ability to incorporate results from a ‘toolkit’ of matching methods. As different data sources contain different matching variables or are stored at different levels (address or individual), there was not a single method that could be applied to all data.

2.4 Challenge 4: the requirement for timely results The QA process involved the review and approval of 348 local authority estimates at a series of QA Panels. Some discussions at QA Panels demanded further analysis. Where issues could not be resolved using data at aggregate level, record matching was required. Data matching systems and methods were designed to respond quickly to these requests and processes were automated where possible. To reduce the turnaround time required for data matching projects, ahead of the QA process Patient Register addresses were matched to the Census address register in 37 local authorities. These local authorities were areas of high population churn, taking into account migration patterns since 2001. As Census processing got underway, they were prioritised by the expected delivery date for their processed person-level data, in anticipation of the order they would be considered by the QA Panels. Record matching for each of these 37 local authorities was suspended if they were approved by the QA Panel, and new areas not in the

3 original list of 37 were added as issues arose. By the end of the QA operation, record matching had been done in 55 local authorities. Identifying the more challenging local authorities and address matching in these areas ahead of the live operation allowed preliminary work to proceed in an intelligent way, and maximised the number of local authorities overall that could be matched. Included in the 55 local authorities were a number of ‘control’ local authorities that were expected to pose little enumeration challenge, for example because they had low levels of international and internal migration. These served two purposes. Firstly they provided context to results for more challenging areas. Secondly they provided a validation of the matching methods used.

2.5 Challenge 5: Keeping the scale of the matching task at a manageable level Some QA issues focussed on small geographic areas or population sub-groups, such as students in communal establishments or babies under the age of one. Where issues were generalised across the population, matching typically focussed on the postcode clusters used for the Census Coverage Survey. The Census Coverage Survey (CCS) is an approximately one per cent sample of the country carried out after the main Census and used to create the Census estimate. The CCS samples postcode clusters within local authorities. Administrative data matching was carried out within these clusters, fulfilling a dual purpose in providing a sample of the local authority and an additional data source for comparison – the Census Coverage Survey. Crucially by using this sample the data matching team were able to provide analysis on a greater of areas. 2.6 Challenge 6: Ensuring quality and consistency in record matching The quality of record matching was monitored and managed through two processes both involving the use of expert clerical matchers. The first involved a continuous feedback loop of matching best practice for the clerical matching team. An example is the accumulation of knowledge and experience in ethnically-specific naming conventions and variations. The second involved expert matchers’ review of matching decisions by the matching team, using both a random sample and having two matchers complete the same matching. Discrepancies found either through the sample or by reviewing the differences between matchers were addressed through further training and review. 2.7 Challenge 7: complexity of matching and storing results at both individual and address levels Storage of the results from matching was complicated by the large number of sources used, the two levels at which matching took place (addresses and individuals) and the reality of one-to-many matches for both addresses and individuals. One-to- many matches for addresses arose from less precise recording of addresses, for example in the Patient Register. This typically involved sub-divisions within buildings (for example ‘Flat 1’) being omitted from a Patient Register address and it was therefore possible for a number of addresses in the census, referenced in more detail, to match to a Patient Register ‘shell’ address. One-to- many individual matches arose from multiple enumerations of individuals in the census (discussed more fully in ONS 2012b). In addition, the matching process allowed unmatched addresses to be matched as a result of individual matching, for example where capture errors produced address differences that confounded the address matching first time around.

4

3. Data available for matching Table 1: Microdata available for Census Quality Assurance matching

Data source Coverage – Supplier Key variables for data (Date) Data units matching (individuals) GP patient England & Wales Connecting for Health Forename and surname registrations Individuals, Systems and Service Date of Birth, Sex (23 April Addresses Delivery, formerly Residential Address 2011) Microdata NHAIS including postcode English/Welsh England & Wales England: Department Forename and surname School Census All maintained for Education (DfE) Date of Birth, Sex schools, special and Residential and (January 2011) non-maintained Wales: Welsh Educational Institution special schools Government (WG) Postcodes Individuals, England: Residential Addresses (England Address including only) postcode Microdata HESA Student England & Wales Higher Education Forename and surname Record Individuals Statistics Authority Date of Birth, Sex (Academic Microdata (HESA) Termtime and Domicile Year 2010/11) postcodes Live Births England & Wales Office for National Forename and surname Register Individuals aged Statistics (ONS) Date of Birth, Sex under 1 at Census Birth Mother Name (Extract year Microdata Birth Mother Date of up to census Birth day) Postcode for place of birth and mother’s residence at time of birth. Deaths England & Wales Office for National Forename and surname (Individuals Microdata Statistics (ONS) (including aliases) with death Date of Birth, Sex date 2 years Address including prior to postcode of residence at census) time of death Electoral England & Wales Local authorities in Forename and surname Registers Individuals, England & Wales Date of Birth (only for Addresses attainers, approaching 18 (December Microdata years of age) 2010) Address including postcode Valuation England & Wales Valuation Office n/a, no individual-level Office Agency Residential Agency (VOA) information Council Tax addresses (September Addresses 2010) Microdata

5

The administrative data used for matching to support census quality assurance are listed in Table 1. Following an assessment of quality and utility for record matching, it was decided not to use the Migrant Worker Scan data for matching. This was because names were not supplied with the data, and because address information was found to be untimely and often not a residential address. The census data used in record matching included: • Census responses (both household and individual questions, taken after all initial data cleaning • Census address register (the version generated at the time the were printed was used initially, then the final version incorporating additions from the Census operation was used) • Census address register History File (ARHF) • Census ‘Associated Address’ records, including Usual Address One Year Ago, responses, Second Residence Addresses, including Students’ Term-time Addresses and Visitors’ Usual Residence. • Census Management Information System (CMIS) field operation data, including ‘dummy form’ information, supplied by enumerators for non- responding households • Census images • Census Coverage Survey responses

4. The matching process

4.1 Data Preparation Each administrative dataset was standardised and cleaned, including de-duplication of records, checking and aligning variable formats, checks for coding inconsistencies and checking the number of unknown or missing values for each variable. The administrative sources were found to need varying amounts of preparation. The most resource-intensive were the Electoral Registers. Maintained and supplied by individual local authorities, these were found to be held in a wide range of formats. Some of the standardisation could be automated but there were also rare and unique differences between the files which required manual intervention to correct. Addresses were georeferenced using the software package ‘Matchcode’, supplied by Capscan. An early evaluation found that data capture errors could lead to details such as subdivisions within properties (for example, Flat 8) being dropped from the standardised address and we therefore also retained the original addresses for further reference. In some cases it was necessary to seek to align the administrative sources to census definitions, for example HESA (Higher Education Statistics Authority) data. The HESA data record all students on a course at an institution within an academic year, regardless of the course duration and individuals may have multiple instances within an institution in the same academic year. To align with census student definitions, a subset of HESA records with start dates before March 27th 2011 (census day) and end dates after, or continuing, were used. Rules to prioritise multiple records were applied to select just one for matching.

6

4. 5. 1. 2. 3. Residual Data Analysis – Data Address Person Resolution - Persons and Preparation Matching Matching Persons Addresses

Automatic Exact and Automatic Probabilistic Longitudinal Cleaning and Address Matching Matching Within Linkage Within Standardising Matching (Exact Diagnostics Postcode and Across PR Within Postcode) LA. Clerical Resolution

Automatic/Clerical Duplicate Outcome Architecture Address Matching Matching Variable Creation (TFIDF Within Within PR Classification Postcode) Clerical Search Resolution Within LA, Census Associated Addresses and Census Images. Architecture Extension of Populate Exact Matching E&W Population E&W Search Decision Rules Census Database (Address and Clerical Search Across Census Matrix Person) Within LA Database

Match PR Results Residual to PR ‘Ghost’ Visualisation Additional Address HESA, SC and Resolution Persons and Resolution ER Through Addresses Through Person Decision Rules Matching

Further Analysis and Modelling of Search Residual Matched and Addresses Within Unmatched Address Register History File Figure 1: Data matching process

4.2 Address matching Addresses in the Patient Register, Electoral Register, Valuation Office Agency data and the English School Census were matched against those in the Census address register within Census Coverage Survey postcode clusters in selected local authorities. For each source, the first stage of address matching involved exact matching between sources. Only three components of the addresses were used: Flat Number/ Property subdivision/ House name, House Number and Road and Postcode. Variables with low discriminatory power such as ‘town’ were excluded as they could only introduce error. To improve automatic match rates, a second stage used ‘Term Frequency Inverse Document Frequency’ (tf.idf) matching, which assigns a weight to each matched pair of words in a pair of addresses depending on how commonly the words within the addresses appear in each of the datasets (discussed more fully in Winkler 2006, see also Li et al 2010). Tf.idf matching used all available address elements. Matched records incorporating ‘Hill Street’ would have a lower weight than ‘Segensworth Road’, due to the rarity of ‘Segensworth’. Scores for each address are weighted according to the number of words included in the address. The best- scoring candidate match for each administrative source address was referred for clerical review and confirmation. A third stage of address matching involved a clerical matcher searching for an address match, firstly within the given postcode and then across the local authority as a whole.

7

Inaccuracies in recording addresses led to some addresses being falsely unmatched. Some of these addresses were subsequently matched through person matching. Where individuals living in unmatched addresses were matched, a check was made to see if these were falsely unmatched addresses due to data discrepancies. Finally, in addition to searching for matches within census data, the Address Register History File (ARHF) was also checked. The ARHF contained addresses that had not been sent a census questionnaire, for example because they were commercial addresses or known to be derelict buildings.

4.3 Person matching Individuals within the Patient Register were matched against census records. Unmatched patient registrations were then searched for within the Electoral Register, School Census and HESA data. The rationale for focussing on Patient Register matching was that the Patient Register was the only record-level source available to us that has near-universal coverage. Understanding comparability between census and the Patient Register would be most informative in terms of understanding coverage of the respective sources. We also anticipated that local authorities would query census estimates that fell below the number of patient registrations, so it was important to understand the characteristics and geography of Patient Register list inflation. As with address matching, the first stage of person matching was exact matching using forename initial, the first three characters of surname and full date of birth (dd/mm/yyyy). Then within matched addresses, the match criteria were relaxed to forename initial or a SPEDIS value of less than 100, first three characters of surname or SPEDIS value of less than 100 and two of the three date of birth elements matched. SPEDIS is a measure of how close the spellings of two words are (Gershteyn, 2000). Within matched postcodes, match criteria were the first three characters each of forename and surname and two of three elements of date of birth. Searching more widely within CCS postcode clusters within a local authority, forename, surname, date of birth and sex all needed exact matching. The next stage of person matching adopted probabilistic techniques. Two strategies were used. The first looked within local authorities at individuals with the same day and year of birth and sex then matched records using month of birth, exact forename and surname with a qgram threshold of 0.4 or above (the code for this is available from ONS on request). Qgrams measure the level of agreement between groups (in our case, pairs) of characters within the two character strings being compared. The second strategy required exact surname matches and forenames with a qgram threshold of 0.4 or above. Among these matching strategies, all of the exact matches were recorded without further scrutiny. Where individuals were matched within matched addresses, these were referred for clerical confirmation where there were name discrepancies, where sex was uncoded and where there was error in dates of birth. All matches within postcodes and local authorities were referred to clerical matchers for review, as were duplicate matches and all matches identified through the probabilistic strategies. Remaining unmatched patient registrations were searched for clerically firstly across the local authority and secondly through ‘associated address’ information. This involved looking at questionnaires where census respondents had listed the Patient Register address as their usual address one year ago, as a second residence or as a usual residence for visitors. Clerical

8 matchers were able to carry out free text searches on name and address and any combination of day, month and year. A further person matching stage involved searching, using a combination of fuzzy matching algorithms, across England and Wales as a whole. Finally, to identify census matches missed because of potential data scanning error, census form images were checked. Any remaining unmatched patient registrations were matched against the Electoral Register, HESA and School Census data. Presence on any of these sources was used to ‘confirm’, for analytical and reporting purposes, an NHS patient, missed by census.

4.4 Residual Resolution- Persons This was an analytic stage that used all available information to assess whether patient registrations that remained unmatched were still resident at the address provided in the Patient Register, or elsewhere in England and Wales. The analysis included checking to see whether patient registrations in 2011 had been de- registered by 2012, although this information was of limited value as there was no indication of a date that the de-registration took place; someone de-registered by April 2012 may have been present on census day, 27 March 2011. A check was made for duplicate records within the Patient Register. In addition, the England and Wales database was searched for residual records at new addresses. Finally, all available information for the addresses and households that unmatched patient registrations were associated with were checked for evidence that these were ‘ghost’ registrations, left at an old address because the individual had either moved house and not registered with a new GP or had left the country.

5. Results and further analysis

In addition to the individual data discrepancies we were able to investigate using record matching, some general findings emerged. The most important of these for Census QA was in relation to the Patient Register, which has been used by a number of local authorities to question their Census estimates. The matching found a clear pattern of Patient Register list inflation that varied by age and sex and in different types of geography. Census and the Patient Register were largely in agreement for under-18s and over-65s. Among the student- age population, often there was an undercount in the Patient Register which could be explained by students travelling away from home and not yet registering with a GP, though this was not consistent across all university towns. In university towns and particularly in Inner London, there were Patient Register overcounts among young adults, which continued but declined with age, and this was more marked for men than for women. This can be explained by a lack of timeliness in the Patient Register, with men less likely than women to update their patient registrations when they move. Further matching sought to bring all available evidence to bear on unmatched Patient Register residuals to assess the likelihood that they were present, and missed by Census, or that they were no longer there. Results will be published more fully during Spring 2013.

9

6. Conclusions

Quality assurance of the 2011 Census used record-level matching to help explain discrepancies that could not be resolved at the aggregate level. This uncovered new evidence of the relationship between Census and other administrative sources. Most usefully, we gained greater understanding of the relationship between Census and the Patient Register. The latter is often used as a comparator for Census as it has near-universal coverage. Record matching was labour-intensive and reserved as a contingency. The imperative for timely results during the Census QA process together with uncertainty about the data to be matched and areas to be investigated meant that systems and processes necessarily incorporated the ability to respond flexibly to the research questions that arose. The data architecture that was developed for Census and administrative data matching was fit for purpose and delivered valuable evidence, both for Census QA and more widely, to inform future record matching. For the future, there is a requirement to identify and develop a more robust system for storing multi-level data, with the possibility of one-to-many- matches between records and data levels, which can provide the same functional flexibility.

References Fellegi, I. and Sunter, A (1969) ‘A Theory for Record Linkage’, Journal of the American Statistical Association, 64(328): 1183-1210 Gershteyn, Y. (2000) Use of SPEDIS Function in Finding Specific Values. Proceedings of the 25 th Annual SAS® Users Group International Conference. Available at: http://www2.sas.com/proceedings/sugi25/25/cc/25p086.pdf Li, D., Wang, S., & Mei, Z. (2010, November). Approximate Address Matching. In 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (pp. 264-269). ONS (2012a) Within-household Bias Adjustment. Available at: http://www.ons.gov.uk/ons/guide-method/census/2011/census-data/2011-census- data/2011-first-release/first-release--quality-assurance-and-methodology-papers/within- household-bias-adjustment.pdf ONS (2012b) Overcount Estimation and Adjustment. Available at: http://www.ons.gov.uk/ons/guide-method/census/2011/census-data/2011-census- data/2011-first-release/first-release--quality-assurance-and-methodology- papers/overcount-estimation-and-adjustment.pdf White, N., Abbott, O., and Compton, G. (2006) Demographic analysis in the UK Census: a look back to 2001 and looking forward to 2011. 2006 Proceedings of the American Statistical Association, Survey Research Section [CD-ROM], American Statistical Association, Alexandria, VA. Winkler, W. E. (2006a), “Overview of Record Linkage and Current Research Directions,” U.S. Bureau of the Census,Statistical Research Division Report, available to download: http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf .

10