Ref. Ares(2017)1768590 - 03/04/2017

30 December 2016

Eurostat Grant Contract No. 07112.2015.002-2015.358 Czech Republic ()

Improvement of the use of administrative sources (ESS.VIP ADMIN WP6 Pilot studies and applications)

Final report

1

1. Introduction

The Czech Statistical Office (CZSO) intensively works on preparations for Population and Housing Census 2021 and is aiming to use administrative data sources to the greatest extent possible. In 2011 census the CZSO used ISEO/CIS (central population register) and statistical register of census units and buildings, however rather as additional sources for full-field enumeration. In long-term prospects, the CZSO plans to proceed to entirely register-based census. Main reasons for transition to register-based census is reducing respondent burden, in a long-term the decrease of census costs, also increased timeliness and frequency of published results. Above all, in the future the full-field enumeration may not be acceptable by public or political representation anymore. Employing administrative data sources is therefore necessary for long-term guarantee of census data production. In 2014, the government adopted resolution No. 734 on Census 2021 preparations and agreed on using existing or new administrative data sources to the greatest extent possible. On the grounds of the resolution, Interdepartmental working group was established comprising mainly Ministry representatives – important administrative data owners. The grant project followed up on outcomes of the Interdepartmental working group. The aim was to assess the suitability of available administrative sources for census needs. The project was divided into four stages. The first stage focused on getting access to the data sources. It included establishing cooperation with data owners, making agreements and transmitting data to the CZSO database. With respect to a statement of the data protection authority, obtained data from the sources covered only population of the defined region (LAU1 Hradec Králové), representing approximately 2% of the whole Czech population. The second area was an analysis of extracted data samples from each source. The data contents, completeness of records, accordance with classifications and code lists, apply logical checks were assessed. Where possible, a comparison of information from data sources with the reference data source (mainly 2011 census) was performed. The third part of the project comprised linking individual records from the main data source (ISEO/CIS) with other sources, including verification of the linking process. The linked records enabled to assess consistency of information coming from different sources at individual level. In the end, the gained findings were applied. This included mostly setting rules (corrections of inconsistencies, over-coverage issue etc.) for primary data sources to create validated harmonized outputs.

2

2. Preparation and extracting administrative data

2.1. Background The basic activity at the beginning of the project was to collect information on using administrative data in other countries. Although in general the practices in countries conducting register-based or combined census were well-known, it was very useful to put the information together in one comprehensive document, including references.

2.2. Consultation with Statistics In August 2016, experts from the Czech Statistical office made a study visit to Statistics Netherlands (CBS) and attended a course on ‘Using Registers and Administrative Data in the Census’ hosted by Eric Schulte Nordholt. The last traditional enumeration in the Netherlands was conducted in 1971. Until then, traditional censuses had been performed since 1829 by the Ministry of Home Affairs and later since 1899 by Statistics Netherlands. Due to the unwillingness of respondents to take part in censuses (non- response) and due to the need to reduce the costs traditional censuses were stopped after the 1971 census. Statistics Netherlands now conducts a register-based census and uses data already available to Statistics Netherlands, thus placing no burden on individuals. For the 2011 Census, Statistics Netherlands compiled required census tables by combining existing register data with sample survey data. There are thirteen basic registers in the Netherlands, eleven of them are functioning and filled with data. The Minister of the Interior is responsible for the system of basic registration and cabinet ministers are responsible for each basic registers. For instance, the Minister of the Interior is responsible for the Population Register, the Minister of Economic Affairs for the Register of Enterprises, the Minister of Infrastructure and the Environment is responsible for Addresses and Buildings, Real Estate, topography, Motor Cars, and Subsoil, and the Minister of Social Affairs for Labour. The registers are managed by municipalities (e.g. the Population Register), the central government organisations, and the Chamber of Commerce. Data are always kept in only one register – for example, an address is recorded in the Basic Register of Addresses and other registers refer to it. Use of basic register data is compulsory for all governmental organisations. Organisations are not allowed to ask inhabitants about the data included in the basic registers, such as their birth date or address. Users are obliged to notify the basic registers with alternative data that are considered to be better. Therefore, all users of the system contribute to the quality of the data. All objects in the basic registers (persons, enterprises, addresses etc.) have a unique PIN (key). The basic registers are linked to one other through PINs. This means that the statistical data are coherent. Each basic register has its own project board which operates within the legal framework and sees to it that the register data meet legal requirements and that the data are correctly applied. The project board meets 4–6 times a year. Statistics based on the basic registers, including the population and housing census, need a limited amount of data editing. The register-based census has a much lower, almost zero level non-response rate, unlike traditional enumeration. Users can rely on the validity of these statistics. Statistics production based on register data is usually faster and cheaper. On the other hand, the costs to set up separate registers are high. Compared to traditional census data, the data in basic registers do not

3 always apply to the same point in time. Some data are also delayed, such as income of self-employed persons. Most of the registers have not been established for statistical purposes therefore they do not contain statistical concepts (e.g. households). Apart from register data for census CBS uses also Labour Force Surveys. The survey contains a personal identifier so it is possible to link economic characteristics from the survey to a person in Population register. Data from seven registers are used in census (Population register, Jobs file, Self-employed file, Fiscal administration, Social security administration, Pensions and benefits, Housing register). All data sources, register and survey data are linked at a record-level by personal PIN that is later replaced by a different key for confidentiality reasons. Approximately 99% of data are linked. Micro integration aids at improving data quality in combined sources by searching and correcting for errors at a record-level. The most usual errors occur in dates (day and month). Collecting data from several sources gives more comprehensive and coherent information on aspects of a person’s life. However, it is necessary to compare coverage and reliability of sources especially in case of conflicting information. For integration it is essential to always check, adjust and impute data. Only in this way is possible ensure optimal use of information and improve their quality. One of the presentations dealt with “The System of social statistical datasets (SSD)”. SSD is a central database containing statistical registers: Persons, relation between persons, Households, Jobs, Self- employment, Social security benefits, State and employer pensions, income, education, Hospitalizations, Causes of death, Criminal offense reports, Houses and Vehicles. These registers are linked. SSD is defined as a set of integrated microdata files with coherent and detailed demographic and socio-economic data on persons, households, jobs and benefits. The system began to develop in the 1990s in order to support Population and housing census in 2001. First of all, adopting relevant legislation – Statistics Netherlands Act and Netherlands Data Protection Act was necessary before setting up the system. These Acts authorize CBS to use personal data and oblige CBS to take adequate measures aimed at privacy protection. Using registers and administrative data in the census has a long-standing tradition in the Netherlands. Register based censuses have been carried out since 1980. The CZSO may take advantage of it, especially in a wider employment of administrative data sources and continuous transfer from traditional to register based census. Study visit consultations helped to identify aspects which lead in huge differences between usage of administrative data in the Czech Republic and in the Netherlands: 1. In the Netherlands, unlike the Czech Republic, all essential data is included in the registers. The biggest problem in the Czech Republic is the absence of a register of dwellings (housing units), which makes a census without field enumeration impossible. 2. Statistics Netherlands is entitled to use all register data for official statistics, register holders do not question their needs to use the data. 3. The CZSO lags behind in an intensity of cooperation within the Office, especially between statisticians and IT experts. CBS is noticeably more self-sufficient in IT field – for example has available software for linking data. CBS has Innovation Lab which looks for new solutions in statistics, such as usage of big data. 4. The Dutch basic registers include actual data especially the population register. The data in the Dutch basic registers, especially the population register, is regularly/easily updated. For instance, changing the address of residence in the Netherlands is very simple: It is sufficient to inform the authority via email; it is not necessary to submit any documents. On top of

4

that, the public and the authorities have more respect for their own legal responsibilities. That is why the quality of Dutch registers is generally very high. Apparently, it is impossible to solve these problems in a few years. Transmission from traditional to register based census could not take place very soon. Creating conditions for register based census will require complex changes in many areas. Preparations for 2021 census must take into consideration the next census; otherwise field enumeration could not be avoided in 2031 as well.

2.3. Establishing relationship with administrative data sources owners and collection of information on the sources The ministries involved in the Interdepartmental working group were asked to provide the CZSO with descriptions of their administrative sources. At the starting point of the grant project, the descriptions of the sources had been already collected, so the CZSO had enough information to make preliminary assessment of their suitability for the 2021 Census. The CZSO has assessed 37 administrative data sources on population and housing issues. Concerning data on persons, the following sources will potentially be usable for Census 2021:

 Central Population Register owned by Ministry of Interior  Information system of Population owned by Ministry of Interior  Information system of Alien owned by Ministry of Interior  Central Insurance Register owned by General Health Insurance Company/Ministry of Health activity  Social Information System of Czech Social Security Administration  Register of owned by Ministry of Labour and Social Affairs  Information system of Social support owned by Ministry of Labour and Social Affairs  Tax registers owned by Ministry of Finance  Education databases owned by Ministry of Education, Youth and Sports (information on pupils and students)  Register of schools and other institutions owned by Ministry of Education, Youth and Sports

Based on the assessments of the sources, the CZSO prepared the document “Proposal of the data collection method in the Population and Housing Census 2021” and submitted it to the Czech Government. In the document the CZSO proposed to conduct 2021 population census based on combination of a full field enumeration and the administrative sources listed above. However, the usability of each source had to be verified based on analysis of individual records. Therefore, in the document the CZSO required individual records from the sources for the purpose of the census preparation. On 13 January 2016 the Government of the Czech Republic approved the document and passed a resolution on Census 2021 that instructed the particular administrative data owners to assist the CZSO with census preparation and transfer samples from their administrative data sources (individual records of persons living in defined territory – Hradec Králové district – representing approximately 2% of Czech population). Nevertheless, the request for data needed to be validated by the Office for Personal Data Protection first. The Office pointed out the importance of bilateral contracts. Every contract should include data specification, should ensure protection of individual data and respect privacy of registered persons (e.g. only variables necessary for census preparation should be requested).

5

2.4. Adoption of agreements with administrative data owners and data transmission Following the instructions of the Office for Personal Data Protection, the CZSO started negotiations with data owners.

Ministry of Labour and Social Affairs (including Czech Social Security Administration) Administrative data sources: Register of Unemployment, Social Information System and other sources The contract between the Ministry and the CZSO was drafted already in January 2016. The draft passed the approval procedure and the contract was sign in the beginning of April 2016. The datasets were prepared and delivered to the CZSO on 11 April 2016. The Ministry and the Czech Social Security Administration have accepted the trilateral draft contract of providing data from the Social Information System and the trilateral contract was signed in the end of April 2016. The data on employees on self-employed persons were submitted on 13 May 2016. The Czech Social Security Administration administers also data on pensions and women on maternity leave which was not willing to include in the original contract. On 11 July 2016 a trilateral meeting was held and an additional contract was drafted. The contract was signed and new datasets were submitted on 10 October 2016.

Ministry of Education, Youth and Sports Administrative data sources: Education databases, Register of schools Although the first meeting took place in mid-January 2016 and the contract was drafted already, the Ministry asked the Office for Personal Data Protection for official opinion on the legitimacy of using individual records from education databases for the purpose of census preparations. The Office confirmed the legitimacy and the Ministry signed the contract in the beginning of April. Data on pupils, students and schools was submitted on 28 April 2016.

Ministry of Interior Administrative data sources: Population register (ROB), Information system of Population (ISEO), Information system of Alien (CIS) The first draft of the contract was consulted in mid-January 2016. After assurances that the draft met the confidentiality rules, Ministry did not apply any objections. Consultations continued over the annex of the contract which was the specification of required data. Ministry agreed to provide all required items in proposed specification. Minister of Interior signed the contract on 31 March 2016. The datasets were transmitted to the CZSO on 4 April 2016.

Ministry of Finance Administrative data sources: Tax registers (Personal Income Tax (self-employed persons), Real Estate tax Return), Annex to corporate income tax return (employers and numbers of employees in municipalities) The CZSO proposed the draft contract in January 2016. After several rounds of comments the contract was signed by both sides on 23 March 2016. The data from the annex to corporate income

6 tax return and real estate tax return were submitted to the CZSO on the same day. Data on personal income tax returns were received on 3 June 2016. Table 1: The timetable of submitting the administrative data samples OWNER SOURCE DATA SUBMITTED Ministry of Finance Corporate income tax return 23 March 2016 Real estate tax return Ministry of Interior ISEO/CIS 4 April 2016 Ministry of Labour and Social OK prace 11 April 2016 Affairs OK centrum OK nouze Ministry of Education, Youth and Regional education databases 28 April 2016 Sports Tertiary education register Czech Social Security Social security register 13 May 2016 Administration Ministry of Finance Personal income tax return 3 June 2016 General Health Insurance Central insurance register 3 October 2016 Company Czech Social Security Pension beneficiary register 10 October 2016 Administration Maternity beneficiary register

Ministry of Health, General Health Insurance Company (VZP) Administrative data sources: Central Insurance Register Representatives of Ministry understood the requirements and recommended to the CZSO to make the contract with representatives of General Health Insurance Company directly as the owner of the data. General Health Insurance Company declared that from a technical point of view, there were no problems with preparing data. However the company refused to provide the data, because they consider the request illegal. By the VZP opinion the Regulation (EC) No 223/2009 (on European statistics) did not entitle the CZSO to demand individual data records from the Central Insurance Register. The CZSO consulted The Office of the Government of the Czech Republic, Department for Compatibility with EC/EU legislation. The Office justified the CZSO’s request for data. The VZP accepted the Office’s statement. The contract was signed on 30 September 2016 and data were transmitted to the CZSO on 3 October 2016.

2.5. Preparation for data storage and data analysis All contracts with data owners included conditions for access and processing the administrative data, especially to ensure personal data protection. The preconditions were defined also in cooperation with the Office for Personal Data Protection. To set up the preconditions (technical aspects of database environment, rules for accessing data, organizational arrangements etc.), internal cooperation with relevant units at the CZSO was established and the internal directive concerning data processing was adopted. In order to safely store the administrative data, a testing database SLDBT was established. The database allowed processing and linking data from different sources. The SLDBT data were stored in database server at the Central information department of CZSO.

7

3. Analysis of administrative data

3.1. Characteristics of data samples

Central Population Register (ROB), Information system of Population (ISEO), Information system of Alien (CIS) These registers (ROB, ISEO, CIS) owned by Ministry of Interior will create an essential part of Census and other data sources will be linked to it. The Central population register (ROB) keeps reference records about persons. It covers Czech citizens, EU citizens, foreigners with residence or asylum permit. ROB guarantees to provide always correct and current information. That means the ROB could not be used for specific dates in history, e.g. decisive moment. That is why ROB was not used for administrative data analysis. The CZSO asked instead for data samples from ISEO and CIS that are basic sources for ROB. Information system of Population (ISEO) keeps records on all Czech citizens and foreigner with Czech relationship. Information system of Alien (CIS) keeps records on foreigners. Both systems have same structure and contain history and date of changes. Therefore ISEO and CIS are more suitable for Census purposes. The data from registers ISEO/CIS will cover some demographic characteristics in Census for the whole population; the topics are Age, Sex, Legal marital status, Number of children, Country/place of birth, Year of arrival to the country, Country of citizenship, Relationship between household members (partly), Household status (partly), Family status (partly). The data sample from ISEO/CIS was submitted in three files on persons and covers 164 854 persons, e.g. 160 040 Czech citizens and 4 814 foreigners, permanently resident in Hradec Králové district as of 1 January 2016. The data contain personal identifier (RC) that allows linking to other data sources and person ID that allows linking within ISEO/CIS tables. The data sample include five files on current and previous citizenships, three files on relationships between a person, parents and children and one file on previous residence of foreigners. Person files contain an address ID; a concrete address could be found in Register of census districts and buildings (RSO) owned by the CZSO using the address ID. For other attributes such as marital status, type of relationship or type of residence there are nomenclatures available. ISEO/CIS uses same come coding for place of residence (municipality, districts, country) as CZSO does in its files.

Social security register Social security register is a part of Social Information system (IIS) owned by the Czech Social Security Administration (CSSZ) which is subordinate to Ministry of Labour and Social Affairs. IIS represents one of the most important data sources for census. It covers significant part of population, specifically employees, own account workers, pensioners and women on maternity leave. The data contain personal identifier which will serve as a link between other administrative data sources. The Social security register data may be used in Census for determination of specific economic activity in economically active population – employees or self-employed. In case of employees, place of work could be set too. Also the data contain business identifier of an employer. By linking the data with Business register Industry could be determined.

8

The data sample from Social security register was submitted in one file covering 154 032 employees and self-employed persons living in Hradec Králové district as of 1 January 2016. However the address was set only by postal zip code so the sample may contain also persons not resident in the district. The sample contain type of security, employee or self-employed, address and a period of time on the security. Unfortunately, addresses are available only for employees and stand for employer’s headquarters not a real place of work. Another complication is in period of time on security where date for end of security is often missing.

Pension beneficiary register Pension beneficiary register is also a part of IIS. It covers beneficiaries of old-age pension, invalidity pension, widow’s and widower’s pension and orphan’s pension. Pension beneficiary register data may serve in Census to determine Economic activity of pensioners or working pensioners by combining data from pension beneficiary and social security. The data sample includes 65 111 persons – current beneficiaries of old-age or widow’s and widower’s pension living in Hradec Králové district as of 1 January 2016. The sample contains personal identifier that allows linking to other administrative sources.

Maternity beneficiary register Maternity beneficiary register is also a part of IIS. It covers women on maternity leave using maternity benefits. The data from Maternity beneficiary register may be used to determine Economic activity for women on maternity leave. The data sample covers 1 031 persons – maternity beneficiaries living in Hradec Králové district as of 1 January 2016. The sample contains personal identifier to allow linking to other administrative sources and starting date of the benefit. There is no ending date; however maternity leave lasts 28 weeks (37 weeks in case of birth of more than one child), ending date could be easily determined.

Register of Unemployment (OK prace) OK prace is owned by Ministry of Labour and Social Affairs. It covers unemployed persons registered with Labour office of the Czech Republic, e.g. most unemployed seeking work including persons seeking first work. By registration with Labour office, unemployed persons are entitled to unemployment support paid by stated. This is the reason why all data contained in OK prace on unemployed persons should be correct. OK prace data may be used in Census to cover core topics Occupation, Industry and Education for most unemployed persons. OK prace was submitted in two tables, persons and records, that covered 8 180 persons permanently resident in Hradec Králové district and currently registered at Labour office and as of 1 January 2016 or with ended registration in October-December 2015. Two tables contain information on addresses. The sample contain nomenclatures for education, occupation, citizenship, reason for registering at the office and reason for ending the registration. OK prace contain person ID that allows linking within the source and RC for linking to other administrative sources.

9

Social supports register (OK centrum) OK centrum is owned by Ministry of Labour and Social Affairs and covers persons receiving social support, such as parental allowance and child care support. The data contain personal identifier (RC) which serves as a link between other administrative data sources. The data sample was submitted in three tables on persons and their applications and seven tables concerning addresses. The person table contains data on social support applicants as well as their children or other relatives if applicable. The data sample covers 12 957 persons permanently resident in Hradec Králové district with current application for social support as of 1 January 2016 or with ended application in October-December 2015. The tables contain person ID that allow linking data within the source.

The System of Assistance in Material Need (OK nouze) OK nouze is owned by Ministry of Labour and Social Affairs and covers persons receiving social support, such as housing support or special social benefits. The data contain personal identifier (RC) which serves as a link between other administrative data sources. The data sample was submitted in one table on persons and seven tables on addresses. The sample contains nomenclatures for marital status and citizenship. The data sample covers 4 466 persons permanently resident in Hradec Králové district with current application for material need benefit as of 1 January 2016 or with ended application in October-December 2015. The tables contain person ID that allow linking data within the source.

Regional education databases Regional education databases are owned by Ministry of Education, Youth and Sports and include information on pupils and students currently in schools. All schools are required to transfer regularly, as of 31 March and 30 September, data from school registers to the education databases. There are separate databases for each level of school: primary, secondary, post-secondary and conservatories. The data samples from the education databases were submitted for each level of school. The sample contains four files on pupils and students of primary, secondary, post-secondary schools and conservatories and 15 nomenclatures on attributes such as sex, citizenship, school grade etc. the data sample covers 14 935 pupils of basic schools, 8 399 pupils of secondary schools, 65 students of conservatories and 565 students of post-secondary schools permanently resident in Hradec Králové district as of 30 September 2015. The data from this source could be used determine economic activity for pupils and students. Also, secondary school students the highest level of education could be set. The data for pupils and students in education databases should be complete and accurate and are updated twice a year. The data contain personal identifier which will serve as a link between other administrative data sources.

Tertiary education registers (SIMS) SIMS is owned by Ministry of Education, Youth and Sports and includes information on current and previous tertiary studies.

10

The data sample covers 38 668 tertiary studies and 21 411 students of universities in a school year 2015/2016 permanently resident in Hradec Králové district. The sample contains 11 nomenclatures on addresses, citizenship, college, level, type and field of education. The data contain personal identifier (RC) which serves as a link between other administrative data sources. The data from this source could be used determine economic activity for students. Also the highest level of education could be set. The data contain personal identifier which will serve as a link between other administrative data sources.

Tax registers Tax registers are owned by General Financial Directorate of Ministry of Finance. General Financial Directorate (GFR) is a central body of financial administration in the Czech Republic. Besides other activities, GFR keeps records and registers, which are needed for the performance of activity of the financial administration bodies. For the Census needs, we examine three tax sources: Personal income tax, Annex to corporate income tax and Real estate tax. The data sample was submitted in one file with data on real estate tax and two files with data from annex to corporate income tax. The data sample covers 34 037 real estates in Hrades Králové district and 7 036 persons with permanent residence in Hradec Králové district. The data from personal income tax were directly loaded to database SLDBT; it covers the whole population of the country. The data include personal identifier (RC) or a business identification number (ICO) that allow linking to other administrative data. Unfortunately the data from tax register were submitted as of 31 December 2014. Therefore the further analysis and comparison or linking with data from other sources was impossible.

Central insurance register (CRP) The CRP is owned by General Health Insurance Company. It contains data on insured persons. Since the health insurance is compulsory, this system should cover the whole population. The CRP data could be used in Census especially for assessment of usually resident population. The population usually resident in the Czech Republic has to be registered at General Health Insurance Company. The data contain personal identifier which will serve as a link between other administrative data sources. Also based on a type of insurance, person’s economic activity could be partly determine, mainly for population not currently economically active, Children and students (without specific information about schooling), Pension recipients, Homemakers, Women on maternity leave and Unemployed persons registered with Labour office of the CR. The sample data was submitted in one file and covers 183 541 persons – health insured.

3.2. Technical aspects of data samples According to the contacts between the CZSO and administrative data sources owners data samples were submitted and then stored in SLDBT database.

11

Information system of Population (ISEO), Information system of Alien (CIS) Ministry of Interior submitted data in the ASCII structure. The data sample contained metadata on number of records in tables, a type of coding and date of export from the data sources. The data were loaded by standard Oracle Database Utility. Table 2: Number of tables in SLDBT database by sources OWNER SOURCE NUMBER OF TABLES Real estate tax return 1 Ministry of Finance Corporate income tax return 2 Personal income tax return 1 Ministry of Interior ISEO/CIS 12 OK prace 11 Ministry of Labour and Social OK centrum 11 Affairs OK nouze 10 Ministry of Education, Youth and Regional education databases 19 Sports Tertiary education register 12 Social security register 1 Czech Social Security Pension beneficiary register 1 Administration Maternity beneficiary register 1 General Health Insurance Central insurance register 1 Company

Social security register The Czech Social Security Administration submitted in the ASCII structure. The data sample contained metadata – a structure of a security number and a nomenclature for type of activity. The data were loaded by standard Oracle Database Utility.

Tax registers General Financial Directorate of Ministry of Finance submitted a part of data in the ASCII structure and another part in XLSX files. No metadata were included. It was impossible to define a specific form of the data in the original source therefore all data were transformed to the ASCII structure. The data were loaded by standard Oracle Database Utility and defined as a VARCHAR2 type in the database.

Register of Unemployment, Social supports register, the System of Assistance in Material Need Ministry of Labour and Social Affairs provided the CZSO with data exports (Oracle DMP files) from their databases. The files were created by the Oracle Data Pump Utility. The submitted data included a database scheme. The data were loaded to SLDBT database also by the Oracle Data Pump Utility. The data in SLDBT are identical to the data source.

Regional education databases, Tertiary education registers Ministry of Education, Youth and Sports submitted data in XLSX files for each level of education. The files contained related nomenclatures in separate XLSX lists. The lists had to be saved to individual ASCII files. These were then loaded as a VARCHAR2 type to the SLDBT database. The data sample included description of the data as well.

12

Central insurance register General Health Insurance Company submitted data in two lists of a XLSX file. First list contained fact data and nomenclatures, second list included addresses. The lists were saved to individual ASCII files. These files were then loaded by Oracle Utility SQL Loader as a VARCHAR2 type to the SLDBT database. The data sample also contained metadata on persons whose health insurance is covered by the government.

3.3. Preliminary analysis of data samples with respect to completeness and logical relations between attributes

Information system of Population (ISEO) The data samples corresponded with a data description in the contract between the Ministry of Interior and the CZSO. The data contained personal identifier (RC); only one record was missing the RC. Most nomenclatures were submitted as a text, therefore consistency between data and nomenclatures was not guaranteed. Attribute Citizenship contained 2 code items that were not a part of Citizenship nomenclature. Marital status and Type of relationship corresponded with the nomenclature. The table Relationship parents-children included 499 records that had no link to the main table of population. Tables of Residences and Citizenships contained correct records and corresponding links.

Information system of Alien (CIS) The data samples corresponded with a data description in the contract. Just as in ISEO, nomenclatures were submitted as a text. Attribute Citizenship contained 4 code items that were not a part of Citizenship nomenclature. The table Citizenship contained 1 record that did not exist in the main population table. The table Relationship parents-children included 8 records that did not have any identifier existing in the main table.

Social security register The submitted data sample corresponded with a data description in the contract. Most records contained a personal identifier (RC) and some of them security number. A list of security numbers was not available to the CZSO; linking was not possible. The nomenclature for a type of activity did not correspond with the data. The sample did not contain codes for type of activity; instead a description of the type of activity in a text format was included. The ending date of the activity was often missing even if a person was very old and not likely employed or self-employed anymore or in cases when a person held multiple jobs. Also there was a problem with addresses. The source did not contain a real address; the area was defined only by postal codes. The data did not seem very reliable.

Pension beneficiary register The submitted data sample corresponded with a data description in the contract. All records contain a personal identifier (RC) however the RC item often included spaces that makes linking impossible. Type of pension was in a text format, nomenclatures were not submitted.

Maternity beneficiary register

13

The submitted data sample corresponded with a data description in the contract. All records contained a personal identifier (RC) however, just as in Pension register; the RC item often included spaces that make linking impossible. The data contained only starting date on benefit but ending date could be easily determined.

Register of Unemployment (OK prace) The data sample corresponded with a data description in the contract. The sample was submitted as a database dump together with nomenclatures for education, occupation, citizenship, reason for registering at the office and reason for ending the registration and addresses tables. Relation between data and nomenclatures were consistent. All records contained a personal identifier (RC) in a normal format.

Social supports register (OK centrum) The data sample corresponded with a data description in the contract. The sample was submitted as a database dump together with a nomenclature for citizenship and tables on applicant’s children and other relations and tables of addresses. Relation between data, nomenclature and tables were consistent. All records contained a personal identifier (RC) in a normal format.

The System of Assistance in Material Need (OK nouze) The data sample corresponded with a data description in the contract. The sample was submitted as a database dump together with tables on addresses. Relation between data and tables were consistent. All records contained a personal identifier (RC) in a normal format.

Regional education databases The data sample corresponded with a data description in the contract. The data samples included data files and nomenclatures, relation between them were consistent. All records contained a personal identifier (RC) in a normal format.

Tertiary education registers (SIMS) The data sample corresponded with a data description in the contract. The data sample contained data and nomenclatures. Attribute College includes 2 code items that did not exist in College nomenclature. Other nomenclatures were consistent with the data submitted. All records contained a personal identifier (RC) in a normal format.

Central insurance register (CRP) The data sample corresponded with a data description in the contract. All records contained a policy holder number. Most of them were equal to a personal identifier (RC) in a normal format. Some policy holder numbers were different and contained different symbols. These records were impossible to link to other sources. Nomenclatures for Types of insurance were submitted separately. Relations between data and nomenclatures were consistent.

3.4. Quality assessment of usable variables

On the grounds of the previous findings, the following analysis focused only on the variables and data sources which had been identified as potentially usable for the 2021 Census purposes. Potentially

14 usable data sources either covered the entire population of the Czech Republic or contained important data on majority of population which could be linked to the main (constitutive) data source.

From a total of thirteen available data samples, potentially usable for 2021 Census are just few of them. The crucial data source is definitely ISEO/CIS. Firstly, it is the only constitutive data source, and secondly, it is the only data source providing usable data on the entire population.

Following ISEO/CIS variables are potentially usable for Census 2021:

 place of (registered) residence  place of (registered) residence one year prior to the reference date  the year of arrival in the country  citizenship  country / place of birth  sex  birth date (age)  legal marital status

Since the remaining administrative date sources can be used as complementary sources, i.e. after linking with the constitutive data sources, the analysis bellow is focus only on ISEO/CIS variables.

3.4.1. Resident population, place of registered residence

ISEO/CIS covers all Czech citizens (ISEO) and all foreigners with permanent residence, temporary residence (EU citizens) or long-term visas and stays (third-country nationals). The definition is consistent with the definition of population used for national statistics purposes. However, it does not meet the definition of usually resident population since the data source contains also persons with registered residency usually living abroad and, vice versa, does not cover all foreigners usually living in the Czech Republic. The difference between registered place of residence and the place of usual residence is more significant at lower geographical level (note: the usual place of residence vs. permanent residence issue has been investigated in full detail in different EU grant project).

The only reference data sources for place of registered residence evaluation are (official) intercensal population estimates released regularly by the CZSO. Nevertheless, population estimates are based on census data updated by administrative data sources, which makes comparison a bit more complex. Despite the methodological difficulties, it is still the most suitable data source for analysis and comparisons.

In ISEO/CIS, the variable registered place of residence is always stated, known for 100% of registered population. While according the data sample, on 1 January 2016 total population of the Hradec Králové district was 163 854, according to the population estimates, total population of the district was 163 159 (based on intercensal population estimates). Therefore, the sample differs only by 1.0%. Even at small geographic level, the differences between population estimates and data sample are not significant (e.g. municipalities’ deviations are in the range from -2.9 to 3.9% (see figure 1).

15

Figure 1: Comparison of number of inhabitants according ISEO / CIS and official population estimates, 1. 1. 2016

Note: population registered in municipalities according to ISEO / CIS per 100 inhabitants according to population estimates Source: ISEO / CIS, Population estimates Nevertheless, the registered residence differs significantly from the usual residence at lower geographical level, as defined in the EU regulation 768/2009. The differences were analyzed using 2011 Census data, providing data on the place of usual residence as well as registered residence at a record-level. The numbers show the significant correlation of the overlap rate of the usual and registered residence on many factors, such as geographic level, locality in terms of centrality rate, as well as socio-demographic characteristics of inhabitants. As figure 2 shows, in case the place of usual residence was replaced by registered residence, population structures at lower geographical level would be significantly affected. Therefore, in order to meet the definition of usual residence, the question on place of usual residence is going to be part of the census form. Only for the persons who will not fill in the question form the data on the place of usual residence is going to be replaced by the data on the place of registered residence.

16

Figure 2: Overlap of the usual and registered populations at small geographic level in selected age groups

Source: Census 2011

3.4.2. Place of residence one year prior to the reference date

Same conclusion can be drawn also for the variable place of residence one year prior to the reference date. The source ISEO/CIS can therefore be used only as an additional data source, for persons without completed census form only. Replacing the data coming from the census form by ISEO/CIS data is, however, possible only for people living in the Czech Republic. As the table 3 shows, the data on previous place of registered residence (in terms of country) is not available for immigrants in ISEO/CIS. Table 3: Residence one year prior to the reference date, district Hradec Králové RESIDENCE ONE YEAR CZECH CITIZENS FOREIGNERS TOTAL PRIOR TO THE REFERENCE COUNT % COUNT % COUNT % DATE Unchanged 151 777 94.8 3 869 80.4 155 646 94.4 Move from outside the N/A N/A 280 5.8 280 0.2 country Move within a country1 6 521 4.1 476 9.9 6 997 4.2 Not stated 105 0.1 170 3.5 275 0.2 Not applicable (age < 1) 1 637 1.0 19 0.4 1 656 1.0 Total 160 040 100.0 4 814 100.0 164 854 100.0 Source: ISEO/CIS

1 In terms of a change between two specific addresses. 17

3.4.3. Year of arrival in the country

The variable date of arrival in the current place of registered residence is available only for foreigners in CIS, not for Czech citizens nor foreigners who have gained Czech citizenship. Since the data is available only for present-day foreigners, 16.2% of the total number of foreigners is missing the date of arrival (see table 4). Related to the entire population, data on date of arrival is not available for 97.7% persons.

Table 4: Date of arrival in the Czech Republic in comparison with date of arrival in current registered place of residence by citizenship, district Hradec Králové CZECH CITIZENS FOREIGNERS TOTAL DATE OF ARRIVAL IN THE COUNTRY COUNT % COUNT % COUNT % < date of arrival in the current place N/A N/A 2 250 50.0 2 250 1.4 = date of arrival in the current place N/A N/A 262 5.8 262 0.2 > date of arrival in the current place N/A N/A 1 256 27.9 1 256 0.8 Not stated 160 040 100.0 730 16.2 160 770 97.7 Total 160 040 100.0 4 498 100.0 164 538 100.0 Source: ISEO/CIS

For the rest of the population with stated date of arrival, there is no suitable reference data source for assessing the reliability of the data. Nevertheless, the relative frequency distribution of the foreigners coming to the Czech Republic is not in accordance with the overall immigration trends during the same period of time. Furthermore, the date of arrival in the Czech Republic coincides with the date of arrival in the current place / address for 27.9% of foreigners (see table 4).

In conclusion, as it is shown in table 4, the variable date of arrival may be considered usable for census purposes just for 55.8% of foreigners. Since it represents only 1.6% of the entire population, the variable date of arrival is not expected to be used for 2021 Census.

3.4.4. Place of birth (within the territory of the Czech Republic)

The place of birth should be, by definition, considered the registered residence of the mother at birth, not the place where the person was born, which can be also location of the hospital etc. In the case of using administrative data sources, e.g. ISEO/CIS, the most suitable approach how to meet this definition is to use data on the place of registered residence of each person at the time of birth. However, this approach can be applied to persons born in the Czech Republic only. Figure 3 presents the relative frequency of registered persons with the place of birth (de jure) in the Czech Republic. For most people born in the Czech Republic the place of residence at birth is not available, or cannot be derived with higher degree of tolerance (e.g. residence at the age of five years maximum).

Furthermore, the variable deals with the same constraint as the previous variables, i.e. it refers only to the place of registered residence, not the usual residence. Therefore, the place of birth cannot be used at a lower than national level and the question on place of birth has to be part of the census form.

18

Figure 3: Persons born in the Czech Republic by age at the beginning of the first recorded residence, district Hradec Králové

40

35

30 25 20 15

the Czechterritory the(%) 10

Share in persons in Sharewithin born 5 0 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65+ Age of persons at the beginning of the first recorded residence

Source: ISEO/CIS

3.4.5. Country of birth

While the usage of place of birth from ISEO/CIS at lower than national level is unsuitable, at the national level it is feasible. As the table 5 shows, the data on country of birth is available for all records in ISEO/CIS.

In order to assess the quality of the data, the ISEO/CIS data is compared with the 2011 Census data as the only available comparative data source. Given the nearly five-year time lag and the dynamics of migration movement only approximate comparisons can be performed. In total, 5.4% of population in ISEO/CIS was born abroad, which is almost identical to the share of people born abroad based on the 2011 Census data (see table 5).

Table 5: Registered population by the country of birth and citizenship according ISEO/CIS and Census 2011, district Hradec Králové COUNTRY OF BIRTH ISEO/CIS (01/01/2016) CENSUS 2011 (26/03/2011) (%) COUNTRY FOREIG CZECH TOTAL FOREIG CZECH NOT TOTAL ISEO/ CENSUS OF BIRTH NERS CITIZENS NERS CITIZENS STATED CIS / TOTAL TOTAL Czech 406 155 498 155 904 273 153 452 67 153 792 94.6 94.2 Republic Abroad 4 408 4 542 8 950 4 114 3 876 469 8 459 5.4 5.2 Not stated - - - 258 223 481 962 - 0.6 Total 4 814 160 040 164 854 4 645 157 551 1 017 163 213 100.0 100.0 Source: ISEO/CIS, 2011 Census

Also the overall structure of the most frequent countries of birth is very similar (see table 6), albeit the rates show noticeable difference.

19

Table 6: The most frequent countries of birth (excluding CZE) of population in the district Hradec Králové according to ISEO/CIS and Census 2011 ISEO/CIS (01/01/2016) CENSUS 2011 (26/03/2011) RANKING COUNTRY FOREIGNERS CZECH TOTAL RANKING COUNTRY FOREIGNERS CZECH NOT TOTAL OF BIRTH CITIZENS OF BIRTH CITIZENS STATED 1 Slovakia 829 3 030 3 1 Slovakia 526 3 045 40 3 611 859 2 Ukraine 1 550 286 1 2 Ukraine 1 885 254 193 2 332 836 3 Vietnam 379 19 398 3 Vietnam 416 11 72 499 4 Poland 184 156 340 4 Poland 146 139 21 306 5 German 64 163 227 5 Mongoli 137 46 183 y a 6 Russia 163 54 217 6 Russia 114 36 11 161 7 UK 40 123 163 7 Moldavi 106 1 15 122 a 8 US 44 101 145 8 German 41 61 3 105 y 9 Mongoli 122 6 128 9 Malaysi 89 1 4 94 a a 10 Romani 51 42 93 10 Bulgaria 44 18 6 68 a Source: ISEO/CIS, 2011 Census

In conclusion, the ISEO/CIS data on country of birth is not of worse quality than Census data. The completeness is 100% and the distribution by countries is as expected. Therefore, the data on country of birth can be obtained from administrative data sources without any significant quality loss. However, since the variable is just one part of the complex question on place of birth, there is no point in removing it from the census form in case the remaining data is not possible to replace by the administrative data. Data on country of birth from ISEO/CIS will be, therefore, used only as a complementary source in order to improve the response rate as well as data quality.

3.4.6. Citizenship Similar conclusions can be drawn from investigating the quality of the data on citizenship from the data source ISEO/CIS. In order to evaluate the reliability and the quality of the data, 2011 Census data as the only available comparative source is used and only approximate comparisons are drawn. In total, 2.9% of population in ISEO/CIS has other than Czech citizenship, which is equal to the share of people with other citizenship according to 2011 Census data (see table 7). Table 7: Registered population by the citizenship according ISEO/CIS and Census 2011, district Hradec Králové ISEO/CIS (01/01/2016) CENSUS 2011 (26/03/2011) CITIZENSHIP COUNT % COUNT % Czech (including Czech + 160 040 97.1 163 423 96.6 other) Other 4 815 2.9 4 960 2.9 Stateless 4 0.0 27 0.0 Not stated - - 802 0.5 Total 164 859 100.0 169 212 100.0

20

Source: ISEO/CIS, 2011 Census Despite the 5-year time lag, also the distribution of other citizenships (excluding the Czech Republic) illustrated by ranking of the most frequent citizenship in table 8 is similar to distribution according to 2011 Census data. In particular, among the most frequent citizenships we can find Ukraine which strengthens its position among other countries. The differences are visible among less frequent countries, such as Bulgaria, Romania, Malaysia, Greece or the UK, which differ considerable, mostly because of very low numbers and 5-year time lag.

Table 8: The most frequent citizenships (excluding CZE) of population in the district Hradec Králové according to ISEO/CIS and Census 2011 ISEO/CIS (01/01/2016) CENSUS 2011 (26/03/2011) RANKING CITIZENSHIP COUNT SHARE OF FOREIGNERS (%) RANKING CITIZENSHIP COUNT SHARE OF FOREIGNERS (%) 1 Ukraine 1 609 33.4 1 Ukraine 2 052 41.4 2 Slovakia 889 18.5 2 Slovakia 773 15.6 3 Vietnam 468 9.7 3 Vietnam 517 10.4 4 Poland 188 3.9 4 Poland 169 3.4 5 Mongolia 152 3.2 5 Mongolia 169 3.4 6 Russia 139 2.9 6 Moldavia 111 2.2 7 Germany 109 2.3 7 Malaysia 102 2.1 8 Moldavia 77 1.6 8 Russia 97 2.0 9 Bulgaria 71 1.5 9 Greece 75 1.5 10 Romania 67 1.4 10 UK 71 1.4 Source: ISEO/CIS, 2011 Census

To summarize, ISEO/CIS provides a good quality of data on citizenship at a record-level covering the whole population. Therefore it can be used for 2021 Census purposes by replacing the question on citizenship on the census form.

3.4.7. Sex and birth date (age) In order to assess the quality of the data on sex and age from ISEO/CIS, the official intercensal population estimates based on census data updated by administrative data sources on births, deaths and international migration were used as a comparative data source. Figure 4 presents the population of the Hradec Králové district by age, sex and two different data sources – ISEO/CIS and intercensal population estimates published by the Czech Statistical office at the beginning of the year 2016. As we can see, both data sources show the same age-sex distribution. There are only slight differences between both data sources visible more in the figure 5, presenting variations between ISEO/CIS and population estimates in the detail. In general, the observed differences are very small, in the range between -2.3% and 9.97%, with the main peak of overestimation in the highest age groups, which is caused by missing or delayed death registration in ISEO/CIS. The underestimation is observed only in two age categories - the infants and young people aged 24–33. While the underestimation of the number of infants is often explained by a delayed birth registration, the underestimation of young people reflects missing international migration registration. Nevertheless, the numbers are very low, amount not more than 63 records, which constitute only 0.04% of the population of Hradec Králové district.

21

Figure 4: Population of district Hradec Králové by age, sex and data source, 1.1.2016 ISEO/CIS Population estimate 90+ 90+ 85 85 80 80 75 75 70 70 65 65 60 60 55 55 50 Females

50 45 Age 45 Females 40 Males Age 35 40 Males 35 30 25 30 20 25 15 20 10 15 5 10 0 5 1 500 500 500 1 500 0 Population 1 500 500 500 1 500 Population Source: ISEO/CIS, Population estimates 2016 (CZSO)

Figure 5: Relative difference in the number of persons registered in ISEO/CIS from population estimate by sex and age, district Hradec Králové, 1.1.2016 10

8

6

4

2

Difference (%) Difference 0

-2 Males Females -4 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90+ Age Source: ISEO/CIS, Population estimates 2016 (CZSO)

To summarize, both variables sex as well as age are in a good quality, meeting the requirements for replacing the census questions on the census form when necessary. However, given the fact that the age (date of birth) is a key variable for linking records between the sources, it has to be included into each census form in any case. Only the data on sex at a record-level is expected to be based on ISEO/CIS data.

3.4.8. Legal marital status Similarly to some previous variables, the quality of legal marital status data coming from ISEO/CIS can be evaluated by comparing its distribution with the 2011 census data. Data on marital status in ISEO/CIS is available for 99.96% of population. Furthermore, as the figure 6 shows, the legal marital status distribution in Hradec Králové district based on the ISEO/CIS data is almost identical to the numbers from 2011 Census data. The main deviations, taking the 5-years time lag into consideration,

22 are visible in the similar age groups as the differences in comparing age distributions, as well as the findings are coherent with the analysis above. Thus, the data on marital status can be gained from the ISEO/CIS source.

Figure 6: Legal marital status distribution by sex and data source, district Hradec Králové

% Males - 2011 Census % Males - ISEO/CIS 2016 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10

0 0

4 9 4 9

- - - -

19 24 29 34 39 44 49 54 59 64 69 74 79 19 24 29 34 39 44 49 54 59 64 69 74 79

------

0 5 0 5

80+ 80+

10.14 10.14

15 20 25 30 35 40 45 50 55 60 65 70 75 15 20 25 30 35 40 45 50 55 60 65 70 75 never married married divorced widowed never married married divorced widowed

% Females - 2011 Census % Females - ISEO/CIS 2016 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10

0 0

4 9 4 9

- - - -

19 24 29 34 39 44 49 54 59 64 69 74 79 19 24 29 34 39 44 49 54 59 64 69 74 79

------

0 5 0 5

80+ 80+

10.14 10.14

15 20 25 30 35 40 45 50 55 60 65 70 75 15 20 25 30 35 40 45 50 55 60 65 70 75 never married married divorced widowed never married married divorced widowed

Source: ISEO/CIS, 2011 Census

3.4.9. Conclusion

Despite the availability of several administrative sources, only ISEO/CIS covers the whole population and, thus, the data can be used as a constitutive data source. However, the quality of the administrative data should be explored first. Therefore, the main distributions of the administrative records were assessed and compared either with 2011 census data or official intercensal population estimates. As a result, there are only few variables which can be used from ISEO/CIS as a constitutive data source – sex, legal marital status and citizenship. The remaining ISEO/CIS data can be used only as a complementary data for census purposes during data processing.

23

4. Linking data

Linking records from different sources to ISEO/CIS is a key action of the whole project. Since ISEO/CIS is the only constitutive source (besides the field enumeration in a real census); the records from other sources can be used only when linked to a valid ISEO/CIS record. The records not linked to ISEO/CIS bring no useful information for the census.

Personal identifier Each Czech citizen and most of foreigners living in the Czech Republic are identified by personal identifier (PIN) called birth number. The number is constructed as follows: YYMMDDXXX or YYMMDDXXXX, where YY…year of birth MM…month of birth and sex (male: MM+0; female: MM+ 50) DD…day of birth XXX… suffix for persons born in 1953 and before XXXX…suffix for persons born since 1954

Table 9: Number of records and basic characteristics of PINs in the data samples RECORDS VALID AT 1.1.2016 MORE VALID RECORDS IDENTIFIED BY PIN RECORDS NAMES SOURCE RECORDS PER IDENTIFIED INVALID PIN TOTAL TOTAL AVAILABLE PERSON TOTAL DUPLICATE TOTAL CORRECTABLE ACCEPTABLE RECORDS ISEO/CIS 164 854 164 854 N 164 533 40 - Y CENTRAL 183 541 160 659 N 160 097 - - - Y INSURANCE REG. SOCIAL 173 841 119 018 Y 118 844 - - N SECURITY REG. PENSION BENEF. 65 111 65 111 N 65 111 - Y REG. MATERNITY 1 032 1 032 N 1 025 2 - - Y BENEF. REG. OK PRACE 8 180 5 902 N 5 902 - - Y OK CENTRUM 12 957 12 957 N 12 912 - 2 2 Y OK NOUZE 4 466 4 455 N 4 455 - 114 114 Y REG. ED. DB – I1 15 532 13 390 N 13 390 4 13 - N REG. ED. DB – II1 8 938 6 662 Y 6 662 - 7 - N REG. ED. DB – III1 69 54 Y 54 - - - N REG. ED. DB – IV1 579 389 Y 389 - - - N TERTIARY EDUC. 38 668 5 573 Y 5 573 - 904 109 Y REG. 1Regional education databases: I – elementary, lower secondary, II – upper secondary, III – conservatories, IV – tertiary, non university education

The whole birth number for persons born since 1954 (10-digit number) has to be divisible by 11 (there is no such a rule for 9-digit numbers). This PIN is recorded in all analyzed administrative sources, although mostly just as an ordinary variable – in most of the sources another, internal ID is used as a main identifier. Table 9 presents basic information on PIN in the analyzed administrative sources.

24

Following steps were done to link the valid records from all sources to ISEO/CIS: a) Identifying valid records Most of the samples provided to the CZSO contained records not valid at a reference date (1.1.2016), such as terminated studies etc. These records were identified and excluded from following processing. For comparison of total number of records and number of records valid at reference date, see table 9. b) Assessing completeness and quality of PIN Several records with missing PIN were found in the sources, but the share in the total number of record was almost negligible (table 9). In some (also very rare) cases errors in the PIN were detected, but part of the incorrect PINs could be transformed to the correct form (such as deleting redundant characters etc.). Where multiple records related to one person was allowed (e.g. tertiary education register), the combinations of PIN and name (first name, family name) were tested to check uniqueness of PIN within each source. Not a single inconsistency was found. c) Identifying and deleting duplicate valid records Very low number of duplicate records was identified (table 9) mostly in ISEO/CIS, where 20 persons recorded both as Czech citizens and foreigners (people who gained Czech citizenship, but their records were not deleted from CIS subsystem) were found. The records have been deleted then.

Figure 7: Results of linking valid records in available sources to ISEO/CIS (direct linking using PIN)

linked not linked 100 90 80 70 60 50 99,6 99,1 91,4 96,8 97,3 97,7 94,4 96,6 98,5 40 73,3 72,0 69,9 30

Linking success success Linking rate (%) 20 10 0

Source d) Linking valid records to ISEO/CIS The findings above indicate that PIN (birth number) can be used for linking valid records from all sources with the valid records in the constitutive source ISEO/CIS. Figure 7 presents the results of the exact linking. For most of the sources the linking success rate was above 95%. Significantly less linkable were all three sources of the Czech Social Security Administration, where only around 70% of valid records were linked to ISEO/CIS.

25

As a next step, the results of the linking process had to be verified, which means to find out if there really is no appropriate record in ISEO/CIS for the unlinked records in other sources on one hand (false non-match) and to check whether the linked records really relate to the same persons on the other (false match). e) Analyzing unlinked records The appropriate way to deal with records not linked via exact match of PIN is to use alternative identifier, mainly a combination of birth date and names (first name, family name) alternatively together with additional characteristics. These variables were not available in all sources (table 9 above). Where it was possible, the exact linking was performed using birth date and the union string of first name and surname (without spaces and without diacritics). However, the success rates were zero or almost zero (6 records in health insurance register was the highest number). Suprisingly, sets of unlinked records showed similar characteristics (frequencies of values) as the sets of linked records. The CZSO asked the Ministry of Interior for assistance by providing the identifiers of all records not linked to ISEO/CIS (over 30 thousand PINs). The Ministry of Interior managed to find vast majority (96%) of these PINs in the valid records in ISEO/CIS, however, the records had not met the criteria to be included to the sample provided to the CZSO. The main reason was the registered place of residence outside the territory of Hradec Králové district for most records. In other words, the fails in linking process were mainly caused by differences in place of residence, not by the errors in identifiers. If the CZSO had obtained all ISEO/CIS records (not only the sample) the linking success rate would have been much higher, based on the findings of the ministry virtually 100%. Unfortunately, providing the CZSO with any other records than the defined sample would be out of the scope of the contract between the CZSO and the Ministry. f) Verifying linked records The following step was to verify whether the records linked by PIN refer to the same person. This could be performed by comparing alternative identifiers of linked records. Since PINs already contain birth date and sex, the only appropriate variables usable as identifiers were first name and surname. First, names and surnames were transformed into harmonized form (changed to capitals, converted to ASCII to eliminate diacritics, spaces deleted) and compared directly. The comparison was usually successful (names and surnames fully matched), not for each case though. In some cases (namely tertiary education register) the share of linked records with different names was surprisingly high (see table 10). Therefore, further analysis of names in pairs of linked records had to be carried out in order to accept or refuse PIN as a reliable unique personal identifier across all administrative sources. Similarities of names, surnames and strings combining as unions of names and surnames were assessed using Jaro-Winkler method. Figure 8 presents overall similarity distribution of names and surnames in pair of records (ISEO/CIS-another source). It is clear that most of the differences were caused by different surnames, while first names corresponded much better. The reason is a common habit of women (rarely man) changing surnames after getting married. This fact had to be taken into consideration, so the rules for deciding whether a pair of records is linked correctly had to be benevolent enough. After testing several approaches, it was decided that if any of the compared pairs of strings (names, surnames and whole names) are similar enough (70% threshold was established), the linked records would be considered a true match.

26

Table 10: Comparison of names and surnames between ISEO/CIS and records linked by PIN (names and surnames without diacritics and spaces) SOURCE LINKED TO ISEO/CIS, NAMES AND SURNAMES NAMES AND SURNAMES (%) WHERE NAMES ARE AVAILABLE CORRESPONDING DIFFERENT CORRESPONDING DIFFERENT Central insurance register 160 963 950 99.4 0.6 Pension beneficiary register 45 355 168 99.6 0.4 Maternity beneficiary register 719 18 97.6 2.4 Tertiary education register 16 108 1 616 90.9 9.1 OK prace 8 058 42 99.5 0.5 OK nouze 4 260 53 98.8 1.2 OK centrum 11 647 159 98.7 1.3

Figure 8: Cumulative distribution of similarity scores – comparison of family names and surnames in pairs of records from ISEO/CIS and another source, directly linked by PIN (100% - total number of pairs of linked records where a difference in names occurred) first name whole name surname 100

80

60

40

20 Cumulative frequency(%) 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Similarity score (Jaro - Winkler)

Figure 9: Effect of deleting common suffix ‘-ová’ from female surnames on similarity scores (100% - total number of pairs of linked records where a difference in names occurred) surname surname without suffix '-OVA' whole name whole name without suffix '-OVA' 100

80

60

40

20

Cumulative frequency(%) 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Similarity score (Jaro - Winkler) Before the final comparison, one more issue had to be investigated. Most of Czech female surnames end with suffix ‘-ová’, which increases similarity score of surnames of completely different roots. To

27 increase quality of assessment, the suffix ‘-ová’ was deleted from surnames before comparison. The effect of the step is presented in figure 9. Table 11: Examples of comparison of family names and surnames in pairs of records from ISEO/CIS and another source, directly linked by PIN LINKED SOURCE (LINKED SIMILARITY SCORE (WITHOUT DIACRITICS ISEO/CIS BY PIN) AND SPACES) FIRST NAME SURNAME FIRST SURNAME FIRST SURNAME WHOLE MAX DECISION NAME NAME - 'OVA' NAME - 'OVA' Magdaléna Šmejdová Lenka Kotková 0.44 0.00 0.41 0.44 False match Julie Kapitánová Yuliya Yagelska 0.70 0.51 0.50 0.70 False match Karin Šuchová Kateřina Kafková 0.75 0.00 0.69 0.75 True match Kateřina Denys Yatsyuk Denis Jacjuk 0.87 0.64 0.74 0.87 True match Eliška Wurfelová Eliska Wurfel 0.80 1.00 0.86 1.00 True match Charlotte Marina Dundová Martina Bartáková 1.00 0.00 0.56 1.00 True match Quoc Viet Le Viet Le Quoc 0.81 0.81 1.00 1.00 True match

The results of applying the defined rules were very positive. Almost every pair of records met the condition of higher than 70% similarity score. Only 8 pairs (from all sources together) were identified as a false match. Visual checks of the results confirmed the reliability of the results. Table 11 presents examples of measuring similarity of records.

g) Acceptance of the results of linking process, creating internal IDs (primary key, foreign keys) and relations in the census database All results presented in this chapter indisputably show that the PIN (birth number) is a very reliable identifier for linking records from various administrative sources. Due to organizational and technical problems with the database it was not possible to create the keys and relations in the database. Therefore in the following actions PINs were used to combine data from the sources. However, since there were almost no records linked without PIN (see point e) above), this issue would have virtually no effect on results.

28

5. Priority, removal of records (over-registration), testing

5.1 Assessing consistency of information from linked sources at individual data level In previous chapter all data sources were linked to the constitutive source ISEO/CIS using a personal identifier (PIN). Linked records allow assessing consistency of information coming from more sources. Sex and age are (as mentioned in chapter 4) included in PIN, so values of these variables are fully consistent among all records linked to ISEO/CIS. Citizenship is a widely recorded variable. Legal marital status is recorded in ISEO and in OK nouze (system of assistance in material need). Economic activity status is another characteristic that can be obtained in more ways; however it is much more complicated, because gaining the information sometimes requires combining more sources.2

5.1.1. Citizenship Citizenship was included in a majority of available sources. Values of this variable were compared to citizenship in ISEO/CIS. Citizenship for Czech citizens was almost fully consistent (99.5 – 100.0%), see table 12. However, there was one exception. Citizenship in OK centrum did not correspond at all with the reference source but most records missed entries for citizenship (95.9%).

Table 12: Citizenship consistency of data sources with the reference source, district Hradec Králové SOURCE CITIZENSHIP CONSISTENT INCONSISTENT Reg. ed. db – I1 Czech citizens 100.0 0.0 Foreigners 78.4 21.6 Reg. ed. db – II1 Czech citizens 100.0 0.0 Foreigners 89.4 10.6 Reg. ed. db – III1 Czech citizens 100.0 0.0 Foreigners 100.0 0.0 Reg. ed. db – IV1 Czech citizens 100.0 0.0 foreigners 100.0 0.0 Tertiary education register Czech citizens 100.0 0.0 foreigners 94.6 5.4 OK prace Czech citizens 99.8 0.2 foreigners 93.1 6.3 OK centrum Czech citizens 1.7 98.4 foreigners 82.2 17.9 OK nouze Czech citizens 99.5 0.5 foreigners 84.1 15.9 1Regional education databases: I – elementary, lower secondary, II – upper secondary, III – conservatories, IV – tertiary, non university education

Citizenship for foreigners did not correspond with the reference source so closely. Good results were in tertiary education register or OK prace (94.6 and 93.1%). Post-secondary schools and

2 In case of addresses, no comparison between data sources was possible since the addresses usually relate to a different time period as well as various geographical level and type of the address. The addresses were input to many data sources together with the record and no later updating was performed. In particular, the ISEO/CIS records include only unique numerical IDs of address points, while the other sources, for example regional education databases, contain only district and municipality. Therefore, the analyses of address points and geographical information are going to be performed in a separate project focused on geo-referenced information.

29 conservatories had only one entry each which made it unfit for analysis. The worst outcome reached Basic schools education database (78.38%).

5.1.2. Marital status

Marital status was included in OK prace. Compared to the reference source ISEO/CIS, 77.8% of the entries were consistent. Table 13: Marital status consistency of data sources with the reference source, district Hradec Králové SOURCE CONSISTENT INCONSISTENT TOTAL OK nouze abs. 3354 960 4314 % 77.8 22.2 100.0

5.1.3. Current economic activity status

The only potentially usable variable coming from other sources than ISEO/CIS is current economic activity status. The information is not directly available in any source, but has to be derived from combination of information from more sources. The possible way to obtain the information is following: Employed: existence of valid record in social security register (employees, self-employed persons) Unemployed: existence of valid record in OK prace (unemployment register); alternatively, information on unemployment is available also in health insurance register Persons below the national minimum age for economic activity: information on age in ISEO/CIS Pension or capital income recipients: partly available – existence of valid record in pension beneficiary register while no valid record on current employment in social security register; alternatively, information on pension recipients is available also in health insurance register Students: existence of valid record in education databases while no valid record on current employment in social security register; alternatively, information on education can be estimated from health insurance register

Testing possibilities of gaining data on economic activity by combining the sources discovered seriously ambiguous information (see tables 14, 15, figure 10). Namely data on unemployed were highly inconsistent). There are two sources containing data on unemployment – health insurance register and OK prace (which is the official unemployment register). However, high numbers of registered persons were unemployed only according to one of these sources. Another serious inconsistency was found in combination of information of unemployment (both sources) and employment. Around 25% (!) of unemployed persons were simultaneously recorded as currently employed in social security register (table 15). Furthermore, combinations of records of pension beneficiaries and employment resulted in unlikely high relative numbers of employed pension recipients (see figure 10 for comparison with 2011 census results). Based on the presented findings the CZSO decided to collect data on economic activity status in field enumeration, unless the quality of information in administrative sources improves in the near future. To solve the issue the CZSO is going to arrange negotiations with register owners in 2017.

30

Table 14: Consistency of information on selected categories of economic activity status, coming from two different sources (100% = number of all persons belonging to the given category according to any of the sources; i.e. union of sources A and B) ONLY ONLY SOURCES ECONOMIC ACTIVITY - SELECTED CATEGORIES ACCORDING ACCORDING CONSISTENT TO SOURCE A TO SOURCE B Unemployed Total 718 5 045 1 171 % 10.4 72.8 16.9 Pension recipients Total 1 139 44 205 2 527 % 2.4 92.3 5.3 Economically inactive students Total 801 8 519 2 176 (age of 15 or more) % 7.0 74.1 18.9 Note: Unemployment – source A: social security register, source B: health insurance register; Pension recipients – source A: pension beneficiary register, source B: health insurance register; Students – source A: regional education databases/tertiary education register, source B: health insurance register

Table 15: Consistency of information on unemployment coming from different sources ECONOMIC ACTIVITY SOURCE: OK PRACE SOUCRE: HEALTH. INS. REG. Registered as unemployed 5 763 6 216 of which registered as employed at the same abs. 1 808 2 207 time % 23.9 26.2

Figure 10: Share of employed persons in pension recipients by age – comparison of sources 60

50

40

30 admin. sources (1.1.2016)

recipients(%) 20 2011 population census

10

Share of employed persons in pension pension in ofemployedpersons Share 0 15 20 25 30 35 40 45 50 55 60 65 70 75 80+ Age

5.2. Prioritization After comparing information from various sources at individual level, setting up rules for prioritizing data sources to overcome inconsistent information was planned. Resulting values of variable would have been stored in the final records to be used for calculating derived attributes and producing results. However, based on findings resulting from previous activities this task would be purposeless. Concerning administrative sources (not field enumeration), the most relevant source would always be ISEO/CIS. Variables contained in this source, such as citizenship, legal family status or place of residence have highest priority. Since these variables are “obligatory” in ISEO/CIS, not other source will ever be used. Variables (partly) unavailable in ISEO/CIS (such as year of arrival in the country) cannot be taken from any other analyzed source. The only variable where the sources could be

31 prioritized is economic activity status. However, information for this variable is split among various sources and highly inconsistent (see previous subchapter) therefore it should be solved in bilateral or trilateral negotiations with data owners. Based on the huge extend of inconsistency the solution should be found in improvement of quality in the original sources rather than in prioritizing information in the CZSO database.

5.3. Dealing with over-coverage in ISEO/CIS (setting rules for identification and elimination of records of persons not living in the Czech Republic) Over-coverage is a common problem in population registers. ISEO/CIS is no exception. Dealing with the over-coverage issue is often based on detecting so called “signs of life”, which means looking for records of a given person in other sources. This approach is appropriate for the Czech administrative sources. In chapter 3 (see figure 7) the linkage success was evaluated from the “complementary” sources point of view. The opposite view is needed when dealing with over-coverage of the constitutive source. Figure 11 presents relative age-specific frequencies of numbers of complementary sources where persons recorded in ISEO/CIS were identified. Figure 12 shows shares of individual sources (this also clearly presents that the only available complementary source covering whole population is a health insurance register, the other sources represent only fragments of population). Both figures show that there are persons registered in ISEO/CIS who could not be found elsewhere. Relatively higher shares of these persons were detected at ages from approximately 25 to 40 years and at highest ages. That corresponds to previous experience with over-coverage in ISEO/CIS (it was partly used in 2011 census) as well as to findings reported in many other countries. In total, 3840 (2.3%) persons registered in ISEO/CIS without a valid record in any other source were identified. However, signs of life can be found in ISEO/CIS record itself. If there is a recent change in the record, such as change of residence or change of marital status, this could be considered a clue about the physical presence of a given person in the Czech Republic. For the purpose of the project one year period was stated. For real 2021 census the period might be reconsidered based on deeper analysis of the whole ISEO/CIS (not only a limited sample). While analyzing results of described over-coverage detection, one more step appeared as appropriate. The available sample from health insurance register contains “historical” records of insurance terminations. Individual checks of these cases (such as comparing with recent reports of deaths) discovered that the records are very reliable. It might have been caused by the late providing of the sample (see chapter 2.4), but the records contain information for instance of deaths before the reference date (1.1.2016) that were not recorded in ISEO/CIS, at least at the moment of obtaining the sample (ISEO/CIS sample was prepared in March 2016). The reliability of the information on terminated insurance was confirmed even when a valid record in another source was found. Therefore it was decided to use terminated insurance (due to death, long-term stay abroad etc.) as a relevant information itself to identify over-coverage. The whole process of dealing with over-coverage can be summarized to following points: a) Linking records of persons with terminated insurance in health insrurance register to valid records in ISEO/CIS (a2: Possible step for real 2021 census – Asking the Ministry of the Interior for revision of particular records identified in previous step)

32

b) Excluding records in ISEO/CIS linked in step a) c) Identifying remaining records in ISEO/CIS not linked to a valid record from any other source (including field enumeration in a real census) d) Detecting the latest change in every record in ISEO/CIS identified in step c) e) Based on steps c) and d) – excluding records not linked with any other source and (at the same time) unchanged for a defined period (one year) f) verifying results

Figure 11 Share of persons registered in ISEO/CIS by number of linked sources and age (100 % = Total number of persons at a given age registered in ISEO/CIS)

% 100

90 ISEO/CIS 80 + 3-6 sources 70

60 ISEO/CIS + 2 sources 50

40 ISEO/CIS 30 + 1 source

20

10 ISEO/CIS only

0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100+ Age

Figure 12 Share of persons registered in ISEO/CIS linked to selected administrative sources by age (100 % = Total number of persons at a given age registered in ISEO/CIS)

% 100 90

80 Linked to any (at least 1) source 70 Social Security Administation 60 sources 50 Ministry of Labour sources 40 Ministry of Education - regional 30 education databases 20 Ministry of Education - tertiary education register 10 Health insurance register 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80+ Age

33

5.4. Application of the approach The results of the above characterized approach that dealt with over-coverage are presented in figures 13 and 14. In total, 3612 (2.2%) ISEO/CIS records were identified as over-coverage. In most of age groups (5-10, 50+) rules applied for diminishing over-coverage led to convergence of analyzed data to the CZSO intercensal estimate (used as a reference source). On the contrary, in ages from approximately 20 to 40 years the gap became even higher (and changed from mostly positive to negative, see figure 14). However, most of the difference was caused by excluding records of persons with terminated insurance. Vast majority of these persons also had no valid record in any other source, so they met both conditions for being excluded. Furthermore, the data on persons at age between 20 and 40 are affected by over-coverage issue more often than most of the other age groups, mainly because of under-registered emigration (this issue bias even intercensal estimates, in this case used as reference source). For these reasons the difference from the reference source can be considered justified and the applied approach suitable for a real census. Figure 13 Effect of applying over-coverage solution on age-sex structure of ISEO/CIS population

90+ 85 80 75 70 Females result after deleting over-coverage 65 60 Males - result after 55 deleting over-coverage 50 Females - no "sign of life"

45 Age 40 Males - no "sign of life" 35 30 Females - terminated 25 health insurance 20 Males - terminated health 15 insurance 10 5 0 1 500 1 000 500 0 500 1 000 1 500 Population

Figure 14 Relative difference in the number of persons registered in ISEO/CIS from intecensal estimate by sex and age before and after applying over-coverage solution

10 8

(%) 6 wihout duplicate records; before 4 solving overcoverage 2 After applying information on terminated insurance 0 Final results -2

intercensal estimate intercensal -4 Relative difference from CZSOfrom difference Relative -6 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90+ Age

34

6. Conclusion

The aim of the project was to get access to administrative data sources for the purpose of assessing their usability for 2021 population census. In the first stage of the project, bilateral negotiations were held between the CZSO and data owners in order to reach agreements on providing administrative data. After signing the agreements, the CZSO obtained samples from thirteen administrative sources during the period from March 2016 to October 2016. Provided samples covered population of district (LAU 1) Hradec Králové at 1 January 2016. Preparing samples by the owners, transmitting data and storing into the CZSO database environment revealed a serious issue: Most of register owners were not prepared for providing external users with data from their systems. Therefore in almost all cases ad-hoc procedures had to be developed. Descriptions of files and other metadata were often incomplete or of poor quality, sometimes data integrity was not ensured. Although unpleasant, these are very important findings coming from the project. As the amount of records was quite limited, the CZSO was able to deal with the issues. However, for a real 2021 census standardized modes for data exchange have to be established. After storing data in the database, analysis of the samples was conducted aiming to assess completeness, reliability and overall usability of selected variables potentially suitable for census purposes. ISEO/CIS (population register) will be the main (constitutive) administrative source, all the other sources will be used as complementary, to add information on persons recorded in the main source. However, as the complementary sources contain very limited variables usable for census and/or cover only specific parts of the population, these sources would be employed mainly to verify validity of records in ISEO/CIS (dealing with over-coverage). Records from all complementary sources had to be linked to ISEO/CIS as the main source; unlinked cannot be used in any way. All sources contained birth number, which is a widely used personal identifier in the Czech Republic. The linking process was performed and its results analyzed. In most of the sources the linking success rate was very high. Virtually all links were successfully verified by comparing alternative identifiers. Vast majority of unlinked records could be explained by minor inconsistencies in variables defining the samples (place of residence within a selected territory). The results of linking process undoubtedly showed that birth number is a very reliable identifier. Linked data from the sources enabled to analyze consistency of information at individual level. Data on sex and age were fully consistent (these variables are included in PIN), relatively satisfactory consistency level was proved in case of citizenship. Consistency of other variables coming from more than one source was poor, namely consistency of information on economic activity status. Last task of the project was to develop and apply approach for dealing with over-coverage in ISEO/CIS. Information on terminated health insurance available in health insurance register together with detecting so called signs of life (searching valid records in complementary data sources) were successfully used to identify and eliminate over-coverage of ISEO/CIS. The overall amount of over- coverage was 2.3% records, which is relatively high percentage compared to numbers reported by some other countries (e.g. Austria, Slovenia), but corresponds to experience from the last (2011) Czech population census. In 2021 census it will be possible to collect some variables entirely from administrative data (such as citizenship or legal family status) or at least for persons not enumerated in field, but full field enumeration will have to remain the main method of collecting census data. In 2021 census the main

35 benefit gained from administrative sources will be in increasing completeness of the census, i.e. adding persons not enumerated in field and reducing numbers of not stated values. The limited number of usable variables available in administrative sources is the main reason why full field enumeration will be inevitable in 2021. Despite this fact, getting access to the administrative sources and starting cooperation with data owners are important steps on the way to wider use of administrative data in future censuses as well as in other statistical domains.

36