Grant for 2016

Improvement of the use of administrative sources

(ESS.VIP ADMIN WP6 pilot studies and applications)

Methodological report

Statistics

2018 Content Introduction ...... 4

Detailed evaluation of the results of the action ...... 6

I Detailed description of the methodology ...... 11

1. Developing methodology for register-based population and housing census ...... 11

1.1 Index-based methodology of using registers ...... 11

1.2 Using partnership index to correct data on household structure ...... 18

1.2.1 Implementing the partnership index ...... 18

1.2.2 Development of partnership index ...... 20

1.3 Deriving new variables from administrative variables and ensuring their compliance with census definitions and comparability ...... 27

1.4 Improving the use of administrative data sources ...... 31

1.4.1 Developing cooperation agreements with the owners of administrative data sources ...... 32

1.4.2 Register of granting international protection (RAKS) ...... 33

1.4.3 Building identifiers for linking administrative data ...... 34

1.5 Compilation of activity status based on register data ...... 35

1.6 Adapting quality framework for the evaluation of administrative data ...... 54

1.7 Case study”How the quality of register was measured for Estonain Building Register (EBR)? ...... 61

1.8 Developing statistical register systems for managing statistical registers of population, businesses, agricultural holdings, buildings and dwellings...... 65

1.8.1 Working out methodology for housing data management in statistical register system ...... 65

1.9 Integration of a new administrative sources ...... 68

1.9.1 Developing register-based housing statistics ...... 68

1.9.2 Building identifiers for linking administrative data ...... 71

1.9.3 Deriving new variables from administrative variables and ensuring that they comply with the census definitions and are comparable...... 72 2

1.10 Developing statistical output production ...... 73

1.10.1 Thematic studies ...... 80

1.10.2 Methodology for compiling daytime and night-time population, study migration ...... 86

1.11 Study of confidentiality issues for renewal of PHC 2000 and PHC 2011 datasets for releasing grid-based data ...... 94

1.12 Exchange of experiences-study visits to the Dutch Statistics and ...... 99

II. Detailed evaluation of the results of the action including assessment of the quality of data ...... 102

III.Adapting quality framework for the evaluation of administrative data ...... 105

IV. Assessment of sustainability over time and plan for implementing further changes in the statistical system as a result of the action ...... 111

4.1 Two possible strategies of using index-based methodology in population statistics ...... 111

4.2 The work to develop partnership index continues ...... 112

4.3 Register’s data quality assessment ...... 112

4.4 Revison of parameters of indexes ...... 113

V Assessment of the possible applicability of the methodology/procedure to other context ...... 115

VI. Summary of the problems encountered ...... 116

VII Description of how further actions could improve the use of administrative sources ...... 118

Conclusion ...... 121

ANNEXES ...... 124

3

Introduction

The grant (EUROSAT Grant Agreement on the ESS.VIP ADMIN WP6 pilot no 07112.2016.004-2016.586) had a dual purpose:

1. To develop the methodology for register-based population and housing census. 2. To develop methodology for measuring the data quality from data sources (registers).

Results of that grant enable Statistics Estonia(SE) to reap the benefiits of using register`s data for the production of dwelling statistics and to quarantee the quality of the output using register data for register-based census. Based on the grant results SE is going and is able to carry out a register-based census second trial in 2019.

According to Statistics Estonia’s strategy, there is big need to be able to meet user needs and to consistently map population changes and assess trends in a speedy manner. In order to achieve this, there is a need to use new data sources, which could make the production of statistics more varied. Therefor one of the important objective was to improve greater and better use of register data. A major achievement was production of the housing statistics based on register data.

Regarding census preparatory works for the next census round, it was vital to create the prerequisites for a register-based census and to develop a system of organised and harmonised state registers, which requires contributions towards raising the level of data quality. The measures required to ensure compliance of all registers with the requirements for statistics include improvement and monitoring of data quality by Statistics Estonia on regularly base. According to the goals many problems were solved within the grant activities:

1. The management and the use of administrative data sources is better at SE. 2. Quality framework for evaluation of administrative data was adopted for census. 3. Statistical register systems was established for managing statistical registers of population, business and agricultural holdings, buildings and dwellings and was further developed for census and population statistics purposes. 4. New administrative sources (e.g. traffic register) were integrated into statistical production.

4

5. Statistical output production was developed (a thematic grid map with resolution 1 km x 1 km has been published for the whole territory of Estonia and a grid map with resolution 500 m x 500 m has been published for all cities and the adjacent areas). 6. An exchange of experiences took place, two study visits were organised – to the Dutch Statistics and Statistics Norway. As already mentioned the aim of the grant was to test the quality of the registers used to form the census characteristics, the functioning of the methodology for a register-based census.

The quality level of data was determined in relation to the established criteria and the quality requirements for statistics.

The methodologists have solved the following main problems connected with the forming of census characteristics:

1) Analysing the relationship of census definitions and value scales with the definitions and value scales used in registers; 2) Testing the quality of registers and making efforts to urge register holders to eliminate any shortcomings; 3) Establishing optimal rules for forming each census characteristic based on register data

Implementation area of grant results: in regular population statistics, census, housing statistics.

According to the timetable the grant actions took place in time. Grant activities started in October 2016 and finished in March 2018. (Annex I).Methodological report was prepared by Kristi Lehto, Kaja Sõstra, Maret Muusikus, Vassili Levenko, Ülle Valgma, Helle Visk, Pille Kool, Krista Türk, Maret Priima, Diana Beltadze.

5

Detailed evaluation of the results of the action

1.Improving the use of administrative data sources

Activities:  Getting access to new data sources: the Employment Register, databases of the Border Guard Board, Housing loans, Estonian Rescue Board, Traffic Register;  Developing cooperation agreements with the owners of administrativ data sources;  Getting access to metadata;  Assessing the suitability and quality of a new data source for use in statistical production;  Testing several sources and making the best choice

Results:

1. Production and output of housing statistics 2. Development of partnership index and partnership markers were worked out 3. Description of used registers metadata in metadata repositorium

2.Adapting quality framework for the evaluation of administrative data

Activitie 1:

 Adoption quality framework by the number and types of quality indicators for census.

Result: Methodology has worked out based on data source quality assesment, instructions for data quality assessment for register holders.

The developed data quality instructions for register holders includes methodologies for measuring and ensuring the quality of data in the information system as a whole.

The instruction specifies a methodology for managing the monitoring and supervision of database quality and includes recommendations for metrics to be used in data quality monitoring.

The developed framework for data quality management includes three elements:

6

1. Data quality model for measuring and improving the categories associated with data quality for statistical purposes.

2. Set of data quality indicators, which can be used for testing different aspects of data quality.

3. Framework for data quality management, which is a set of iteratively implementable actions to ensure data quality.

Activities 2:

 The identification and comparision of all the quality aspects identified for administrative data(source, metadata, data and process)for regsiter-based census and other statistics.

Result: Report to register holders

Activities 3:

 Set up procedures for the quality assessment of registers at organizational level

Result: Statistics Estonia has issued requirements for register holders:

Activities 4:

 Assessment and improvement of quality (e.g., coverage)of registers used in the pilot Census in 2016

Statistics Estonia assessed the quality of data in databases. We worked out instructions for ensuring data quality:

Results:Data quality report which consists assessment results of 17 registers.

3.Developing statistical register systems for managing statistical registers of population, businesses, agricultural holdings, buildings and dwellings

Activities:

 Building identifiers for linking administrative data;  Integrating a new administrative source in statistical production  Deriving new variables from administrative variables and ensuring that they comply with the census definitions and are comparable.  Developing register- based housing statistics  Developing register-based study migration

7

 Improving statistical frames.

Results: New data sources were implemented into index approach based methodology and methodology for the register-based census using 24 different registers was presented.A study and comparison of definitions was done for census purposes using new registers(4). Statistical register system was prepared for housings census.

Housing statistics was produced as well new products in the field of migration

4.Developing methododlogy for register-based population and housing census

Activities 1:

 To compare differences between statistical and register-based definitions.

Results: Work was started in October in 2016. Analyse was finshed before March 2018.

Activities 2:

 Determinung whether the adoption of new administrative sources will help to remedy the problematic situation where the data on the registered and actual residence of an inhabitant differ by register.

Results: Preparatory works started in October2016 . In March was done analyse with LFS III quarter data.EU-SILC data were analysed in December 2017. The end of this work was postponed up to 27-th of March in2018.Despite of that we did not get data for LFS IV quarter, so this data set remained out of project scope.

Results:Report was presented to the census steering commitee.Report concludes that for one fifth of the population the registered place of residence is different from the actual place of residence.

Activities 3:

 The examination of data on the place of work and occupation of employed persons.

Results: Review of data is done and decisions for census usability purposes was introduced to the census steering commitee.

Activities 4:

8

 Different registers are under- and over-covered in terms of the characteristics and total population of dwellings.

Results: The methodology was worked out published in Quarterly Bulletin of Statistics Estonia in 2017 No 3 and in 2018 No 1.

5.Developing output of statistics

In the production of the GIS output based on the data of the previous census, it turned out that releasing geo-referenced data is problematic in Estonia because of the low population density. We had to implement a harmonised methodology for ensuring confidentiality in the case of all GIS products in all of our statistical output.

Activities 1:

 Overcoming confidentiality problems (safeguards and methods to protect confidentiality, anonymisation of datasets).

Activities 2:

 Geocoding of address and building registers for statistical purposes missing codes and update codes.

Activities 3:

 To implementt method for privacy and security, to work out security measures required for gridstatistics for census

Results:

1.Revison of rules and according to new ones production of statistical output using register data

2.Blog about dissiminated georeferenced data https://blog.stat.ee/2017/12/21/gumnasistide-koolitee-pikkus/

3. Tested methodology for overcoming confidentiality problems for the next census round and a methodology was worked out for ensuring the confidentiality of data for GIS products at SE.

9

6.Exchange of experiences-study visits to the Dutch Statistics and Statistics Norway

Results: Concept for pilot census to test plan B based on gained knowledges from Dutsch Statistics register-based statistics methodology.

10

I Detailed description of the methodology

1. Developing methodology for register-based population and housing census 1.1 Index-based methodology of using registers By Ene-Margit Tiit

Definition of residency, external migration, family nucleus and living quarters of households

In using registers, we always suppose that they are correct, and that the information in registers is, in general, adequate. This premise is true if the registration of an event is more or less automatic, i.e. independent from the wish of a person.

The situation is different if a person must be active in registering himself/herself (for instance, in the case of external migration). The situation is especially difficult if some bonuses are connected with registration. An example of this is registering the living quarters, as in different places, there are different bonuses available – free transport, better or more convenient kindergarten, etc. There are also some events/situations that cannot be registered in any administrative register, for instance consensual partnership as a core of family nucleus.

As the above mentioned situations are very important for defining the central concepts of population statistics of a country – such as population size, external migration, household structure and spatial distribution of the population in the country – the official register data must be checked and, if necessary, corrected. In reality, it is generally not possible to correct data in administrative registers, but, in this case, it is necessary to build a statistical register containing statistically corrected data, and use these data for projections and decision-making.

Here somebody may ask – how it is possible to correct registers using only register data. The answer comes from old traditions of statistics – using repeated measurements allows to make the measurement results more precise. In a similar way, using a large amount of registers, it is possible to improve the quality of administrative data.

Necessary premises for developing index-based methodology

11

It is necessary that in the country there exists a rich set of administrative registers, satisfying the following conditions:

1. All persons in all registers are identified by their unique ID codes;

2. All living quarters (dwellings, family houses, etc.) in all registers are identified by their unique address IDs;

3. All registers cover the whole population of their scope and are regularly (at least yearly) updated.

There is a list of the enlarged population of persons that contains all persons who have lived in the country (at least during the past ten years) and who have been active in any register. All persons in this list have ID codes. In the future, we will use indexes j and h for persons. The total number of persons in set H is, in general, larger than the population size.

There is a list of the enlarged population of dwellings, containing all liveable dwellings. As the index of dwellings, we will use g; G will denote the number of all liveable dwellings at the moment (year), including inhabited and not-inhabited ones.

Both of the enlarged lists are updated yearly, taking into account, in the case of persons, natural changes (births and deaths) but not migration. In a similar way, the list of dwellings must be updated yearly by adding new living quarters and deleting demolished ones.

Defining signs SOL, SOP and SOD

From each register containing information about people living in the country, it is possible to get signs which are useful for making decisions about persons. In the future, we will use i to mark a register; the total number of registers is I. The year is denoted by k, and it is assumed that all calculations are made yearly; the decisions about situation in year k are made using the information (events) in year k—1.

Sign of life

A sign of life (SOL) shows if a person was active in the country of his/her residence in a fixed year. In general, SOL E(i,j,k) is a binary variable, having values 0 and 1, and showing if person j (j = 1, …, J) was active in register i in year k or not. To get the necessary information each year for all the people from the enlarged population, their activity in all possible registers (i = 1,2, …, I) is checked.

12

Examples of SOL: visiting a doctor in the country, learning in a school, working in an enterprise situated in the country, getting any social support, etc.

Sign of partnership

A sign of partnership (SOP) P(i,j,h, k) shows the connection between a couple – two persons j and h from the enlarged population; the list of possible registers i contains the registers that show possible connection between the persons in year k.

Examples of SOP: having a common child, being married, being divorced (negative sign), having common real estate, etc.

Sign of placement

A sign of placement (SOD) D(i,j,g,k) shows the connection between person j and living quarters (dwelling) g in year k.

Examples of SOD:being registered as a person living in a given dwelling; ownership of dwelling; paying for electricity, heating, etc. in a given dwelling.

Concept of index

Index is defined yearly as a linear combination of signs received during the previous year. The value of an index varies between 0 and 1, and it can be considered a probability of a positive event: the person is a resident, the pair of persons constitute a partnership couple, the household lives in the given dwelling. To make a decision, a threshold is used: if the index value is higher than the threshold, the event is considered positive.

In defining the index, there are two crucial questions:

1. Defining the weights of signs;

2. Defining the threshold that allows to make a decision.

Calculation of weights

In principle, the task can be solved by using some multivariate procedure (logistic regression, discrimination) or machine learning procedure, if there exists a good and reliable training sample.

13

There are several possibilities: using census data (but they are useable only for a certain number of years after census), using some survey materials (here the problems are connected with sample sizes) or using data of previous years (which is impossible in the first year of study).

If the procedure has been used already for several years, there is a rather simple way to calculate the weights empirically. Let N(1) be the set of all people whose index value in the previous year was exactly 1 (“confident residents”) and N(0) the set of all people whose index value was exactly 0 (“confident non-residents”). Then, for each sign B(i,k), we calculate the weight a(i,k) as a ratio of averages:

a(i,k) = M1B(i,k) M0B(i,k), (1) where M1 is the average frequency of sign i in the subset of confident residents and M0 – in the subset of confident non-residents. Here it is assumed that usually also some non-residents get some SOL, but such cases cannot be too frequent. This way the weight a(i,k) shows how “strong” or useful is SOL i. All weights are recalculated yearly; in general, their values are not constant, but the changes are quite small.

Instead of weights a(i,k), in practice, their logarithms p(i,k)= ln(a(i,k)) are used, as their variation is much smaller.

Calculation of threshold

To calculate a threshold, empirical training data are used, and the value of the threshold is calculated in a way that the inclusion error and exclusion error are in balance and as minimal as possible.

Stability term

Getting signs by a person is a random process. In decision-making, such randomness should be eliminated as much as possible. In fact, changes of residency, living quarters or household status do not occur not very often. That is the reason why, in making a decision about a persons’ status, we have to know his/her status in the previous year. In making a decision, it is useful to not only use the information about signs received in the previous year, but also information about the past situation. Hence, the common form for an index is the following:

퐼 푅(푗, 푘) = 푑푅(푗, 푘 − 1) + 푔 ∑푖 푎(푖, 푘)푆(푖, 푗, 푘). (2) where d and g are empirically defined parameters (0≤d, g≤1, d+g =1), and S(i,j,k) the sign received by person j from register i in year k. The first term of the formula (2) is the stability

14 term warranting that the status of a person does not change too rapidly (compared with empirical data). The index value, calculated by formula (2) will be truncated to the interval [0, 1]. The decision about the status of person j will be made using threshold c. In the case when the index has been calculated in the first year, the first term equals zero.

Concrete indexes

1.Residency index

Residency index has been used in Statistics Estonia since 2015: the official population size and external migration size are calculated using the index. Also, the size of transnational population has been estimated using the residency index.

In residency index, about 20 registers and sub-registers are used; in time, the number of registers increases. The values of parameters are estimated in the following way: d = 0,8, g = 0,2, c = 0,7. The estimated inclusion and exclusion errors were less than 3% in 2013 and less than 1% in 2017; the new check is in process.

2.Partnership index

Partnership index was calculated for the first time in 2017, using about 10 signs from 7 registers. Additionally, some continuous explanatory variables using time were created (such as the age of the youngest child, duration of marriage and the age difference of partners).

A test population consisting of about 20,000 pairs was created on the basis of surveys conducted in the previous two years.

Four procedures were used for estimating weights – logistic regression analysis, linear discriminant analysis, weights calculated by formula (1) and their logarithms. In all the cases, the sum of inclusion and exclusion errors was 15% (in the optimal case, inclusion error was about 5%). As the calculations were made for the first time, it was not possible to use the stability term of the formula (2).

By using the partnership index and information from Population Register, where all child- parent connections are fixed, it is possible to create all nuclear families:

1. Couple without children;

2. Couple with one child or more children of both partners (adolescent, but also older);

3. Single parent with one or more children.

15

For forming households, some additional information is needed.

3.Placement index

Placement index is a part of the general index methodology, and has not been used in Estonia. The common scheme of using indexes is the following:

1. Check the residency of persons and in the future consider residents only.

2. Form partnership couples and families.

For each family and single person, a placement index will be computed as a weighted sum of all the placement signs connecting a person (persons in the household) with particular living quarters. It is possible for some persons to have indexes connecting them with different living quarters, and particular living quarters can also be connected with several families.

We suppose that also for those who do not have any signs of placement, a „place of existence“ is fixed in registers, i.e. a city or municipality where he/she probably lives.

For all families, the suitability characteristic is calculated on the basis of family size and characteristics of family members (sex, age, status). For creating this characteristic, expert estimates and empirical data on satisfaction of families with their living conditions are used.

In a similar way, characteristics are created for all living quarters, containing information about their size, number of rooms, etc. The placement index connecting a family with a dwelling will be supplemented with a term characterising the suitability of the living quarters for the particular family. In decision-making, also this supplementary term will be used.

In general, each dwelling will be connected with one family, but there are some exceptions:

a. In one dwelling there may live two families, if they are relatives (two generations);

b. It is possible that in the living quarters of a family there may live some single persons (e.g. students or other subtenants).

The people/families who do not have any placement signs will be connected with empty dwellings that are suitable for them and are situated near their place of existence. If there are several possibilities, random choice will be used to assign a dwelling for each person/family.

16

References

1. Maasing, E., Tiit, E.-M., Vähi, M. Residency index – a tool for measuring the population size. ACTA ET COMMENTATIONES UNIVERSITATIS TARTUENSIS DE MATHEMATICA Volume 21, Number 1, June 2017 2. Maasing, E. (2015). Eesti alaliste elanike määratlemine registripõhises loenduses. Tartu Ülikool, MSI magistritöö. [www] http://dspace.utlib.ee/dspace;/handle/10062/47557 (25.08.2015). 3. Tiit, E.-M. (2015). REGISTRIPÕHISE RAHVA JA ELURUUMIDE LOENDUSE METOODIKA JA SELLE ARENGUSUUNDUMUSED. Eesti Statistika Kvartalikiri 3/15, lk 42–64 4. Tiit, E.-M., Vähi, M. Indexes in demographic statistics: a methodology using nonstandard information for solving critical problems. Papers on Anthropology XXVI/1, 2017, pp. 72–87 5. Tiit, E.-M., Visk, H., Levenko, V. (2018). Partnership index. Eesti Statistika Kvartalikiri 2/18 6. Tiit, E.-M., Vähi, M., Kool, P. (2018) Paiksed ja hargmaised eestlased. Akadeemia, nr 2, lk 231– 253.

17

1.2 Using partnership index to correct data on household structure By Kristi Lehto, Pille Kool, Helle Visk,Ene-Margit Tiit, Vassili Levenko

1.2.1 Implementing the partnership index In the couples’ dataset, fathers and mothers of minor children are recorded as single parents if no partner – spouse or cohabitee – is registered in the same dwelling. This is a ‘register-based’ determination of single parents, even though they may not actually be single parents. The primary goal of using the partnership index is to identify their partners to make the record reflect the actual family structure.

The partnership index rule (e.g., an aggregate index) is applied to the single parents' part of the couples’ dataset. This results in about 35,000 couples where a single mother is connected to a man and in about 6,000 couples where a single father is connected to a woman. The datasets of single parents are combined to form the basis for further processing. The numbers of couples were derived from actual methodology tests and this stage has been completed for the created dataset.

Establishing family composition The dataset of couples is tested for multiple instances of the same individual. If there are two (or more) couples with the same person (man or woman), the couples are compared based on the timing of significant events. The couple with the most recent date of such an event (marriage, childbirth) is identified. The couple (couples) that had such events at earlier dates is (are) deleted from the couples’ dataset. The procedure is repeated until there are no longer any multiple instances of the same individual in couples.

The number of (minor) children in the family is calculated for each couple by adding up the children that the couple had together and the children each partner had separately. Adding student-age children (19−20 years) would be a conceivable option. Additional information on the number of children can be obtained from the Population Register.

Identifying the couple’s dwelling If a couple has a shared dwelling, it is deemed to be the family’s dwelling.

18

However, in the usual scenario, the database does not indicate a shared dwelling (this is due to the assumptions used), but several potential dwellings are available:

 A registered dwelling of either partner;

 A vacant dwelling owned by either partner;

 A secondary dwelling of either partner.

Suitability of dwelling for the family The following criteria were used to select the best-matching dwelling for a couple:

• The dwelling must be suitable for all-year-round habitation;

• The dwelling should not include other households (except for close relatives of the partners, such as adult children or elderly parents of the partners);

• The size of the dwelling should be adequate considering the size of the family (at least 1 room per adult member and 0.5 rooms per child; available heating and water supply system; a kitchen or kitchenette).

Choosing between family dwellings If several dwellings meet the criteria, the dwelling with the most residence indicators is selected.

Residence indicators include:

• Official registration of residence in the dwelling;

• Ownership of the dwelling;

• Various payments associated with the dwelling.

For each dwelling associated with a family, the total value of residence indicators is calculated for all household members and the family is deemed to be residing in the dwelling with the highest total value of residence indicators. If this value is equal for several dwellings, the determination is based on the date of the latest event.

Establishing household composition If other persons are living in the dwelling assigned to a family, these persons are counted as members of the associated household (outside the family nucleus).

19

Identifying dwellings of remaining families If a family has no dwellings that meet the criteria and have relevant residence indicators, the family is placed in a suitable vacant dwelling based on geographic proximity (to the municipality of residence, workplace, or children’s school).

For all other persons, the dwelling data remain unchanged.

It is necessary to carry out verification survey of households and dwellings, to prepare for the census by performing the following tasks:

• Compare three survey methodologies in terms of coverage/loss and use of resources;

• Check the decisions made with regard to residency, incl. special attention to commuters and transnational residents;

• Check the relation of actual and registered dwellings of persons/households;

• Ascertain the composition of households (based on two different definitions), check partnership;

• Ascertain the share of tenants (for different types of tenancy);

• Ascertain the presence of secondary dwelling and the activity of its use.

1.2.2 Development of partnership index

Motivation The first pilot census on January 1st, 2016 (PC2016) confirmed insufficient accuracy of place of residence data in Population Register (PR). Comparing the structure of households formed in PC2016 with Population and Housing Census 2011 (PHC2011) revealed major differences. The number of lone parents in pilot census was 67% higher compared with PHC2011; the number of registered or cohabiting partners was 26% lower.

In the register-based census, household-dwelling concept is used to define a household. In principle, the household is formed of people who share the place of residence. Hence, the quality of household statistics relies on place of residence data. The results from pilot census hint that false registering breaks families apart.

20

Tiit et al. 2018 argue that disparities in household structure may be partially caused by different definition of household (household-dwelling vs. housekeeping concept) and changes in time, but the main cause is inaccurate place of residence data in PR [1].

Estonian Labour Force Survey 2015 contained a section on registering place of residence. 12% of 15-74-year-olds did not have their actual place of residence in the PR. The majority of them explained that they did not consider registering necessary, perceived their actual residence as temporary, or there were certain local benefits or services involved [2].

Our goal is to reunite families that are broken due to false registering. The first goal is to find the partners for people who appear as lone parents in registers. Instead of relying solely on place of residence, our idea is add data from other administrative sources to estimate the probability of partnership [3]. The concept is similar to residency index where the probability of residence is modelled as a function of signs of life i.e. indicators of activity in various registers, such as working in Estonia, buying a car or getting a drug prescription. Now we are interested in the signs of partnerships (SOPs)—indicators of presence or absence of partnership. Marriage, co- ownership of property, sharing a car are all examples of SOPs. We try to combine the SOPs in the most informative way to predict the partnership status.

Signs of partnership In administrative sources, there are various pieces of information that show some connection between people. They vary in strength—married couples are probably partners, but co-owning a property rather hints that people may be partners, direction—positive SOPs, like marriage, increases probability of partnership, whereas negative SOPs, such as divorce, makes it less likely. We assume actual partners share more (positive) SOPs than non-partners.

We define quasi-couples as two persons who share at least one SOP. Note that one person may belong to several quasi-couples. Table 1 illustrates the structure of quasi-couples’ data on SOPs. Each record represents a quasi-couple and their SOPs. Albert is married to Betty, they also have a child and co-own a property. Betty also appears in the other quasi-couple with Charles. They are divorced, but have two mutual children. Charles and Doris have taken housing loan jointly, they also co-own some piece of real estate.

21

Table 1. An example of quasi-couples' data

Male Female Married Number Housing Real Divorce of mutual loan estate children Albert Betty Yes 1 No Yes No Charles Betty No 2 No No Yes Charles Doris No 0 Yes Yes No

In the current version of partnership index:

(i) we do not consider same-sex couples;

(ii) quasi-partners are at least 18 years old;

(iii) quasi-couples formed by close relatives are excluded (PR and PHC2011 provide data on kinship).

As in the example, we do not consider same-sex couples in the current version of partnership index. Quasi-couples with underaged partners are also excluded. Quasi-couples formed by close relatives are identified using parental data from PR and PHC2011, and excluded from analysis.

Table 2 includes SOPs that have been collected and used for analysis by March 2018, along with their sources and prevalence. It also includes data on partnership status from surveys: Estonian Labour Force Survey and Estonian Social Survey. Knowing the true partnership status of quasi-couples allows us to find optimal parameters for the partnership model and estimate its accuracy.

There were 536,127 quasi-couples altogether, the partnership status was known for 19,243 quasi-couples.

We extracted 200,382 married couples from the PR. In some cases, the records were contradictory, for example, two women ‘shared’ a husband. These conflicting cases formed a separate SOP, ‘half-marriage’. About 1% of all marriages were considered as ‘half-marriages’. In the PR, there were 86,999 divorced quasi-couples, additional 955 quasi-couples were ‘half- divorced’ i.e. they had conflicting records on divorce.

Place of residence in PR, as imperfect as it is, still serves as an important SOP. Over half of the quasi-couples shared a place of residence, making it the most prevalent SOP with 275,092 22 quasi-couples. The Land Register allows us to find co-owners of real estate (90,308 quasi- couples). We also formed 256,085 quasi-couples where one quasi-partner’s place of residence was in the other quasi-partner’s property.

When declaring the income, spouses could submit joint tax return in 2016 (78,784 quasi- couples used this option). Due to changes in Income Tax Act, this option is now discontinued [4]. Declaration of income also provides us data on joint housing loans (44,456 quasi-couples), because there are tax benefits on housing loan interests.

In the Table 2, children is a binary yes/no variable that shows having at least one mutual underaged child. Stillbirths from 2012–2015 are also considered as children. Almost third of the quasi-couples (166,967) had a mutual child.

2101 quasi-couples are formed of people who have received subsistence benefit within the same household in 2015. 2096 of quasi-couples are tied by paying maintenance for children.

Table 2. Signs of partnership: data sources and prevalence among all quasi-couples and quasi-couples with lone parents.

Quasi-couples with Sign of All quasi-couples Data source lone parents partnership N % N % Marriage Population Register 200,382 37.4 22,143 23.0 Half-marriage Population Register 1,908 0.4 99 0.1 Declaration of Register of Taxable 78,784 14.7 11,035 11.5 income Persons Housing loan Register of Taxable 44,456 8.3 10,135 10.5 Persons Real estate Land Register 90,308 16.8 15,151 15.7 Place of Population Register 275,092 51.3 9,847 10.2 residence Place of Population Register, Land 256,085 47.8 23,679 24.6 residence in Register other quasi- partner’s property

23

Quasi-couples with Sign of All quasi-couples Data source lone parents partnership N % N % Subsistence Social Services and 2,101 0.4 359 0.4 benefit Benefits Registry 2015 Children, incl. Estonian Medical Birth 166,967 31.1 70,354 73.1 stillbirths Registry (2012–2015), Population Register Divorce Population Register 86,999 16.2 15,378 16.0 Half-divorce Population Register 955 0.2 44 0.0 Maintenance e-File 2,096 0.4 1,292 1.3 Partners in Estonian Labor Force 8,668 45.0 1,242 36.6 survey data Survey 2015–2017, Estonian Social Survey 2016

The distribution of SOPs is somewhat different among quasi-couples that include at least one ‘lone parent’ (in the sense of PC2016). Obviously, they tend to have children and they rarely share the place of residence. Compared to all quasi-couples, there are less marriages and more alimonies, but proportion of divorces does not differ.

Figure 1 depicts prevalence of SOPs among survey partners and non-partners. The axes are on logarithmic scale (characterizes the order of magnitude). The diagonal line y = x separates positive SOPs (appear more among partners, blue) from negative SOPs (appear more among non-partners red.)

24

Figure 1. Signs of partnership among partners and non-partners. Horizontal and vertical axes are on logarithmic scale. Diagonal line marks equal proportions.

The ability of an individual SOP to distinguish partners from non-partners is weakest on the diagonal and strengthens as the distance from the diagonal grows. In fact, the length of dashed line corresponds to logarithm of the ratio of two proportions that is used to construct weights for SOLs for residency index. The strongest SOPs are declaration of income, housing loan, marriage, maintenance and divorce.

Models The article that covers our first partnership models in detail is about to be published in the end of March, 2018 [1]. Here, we provide a short overview of the models involved in the article.

The list of covariates include SOPs and

- the age difference of quasi-partners,

25

- the length of marriage,

- the age of the youngest mutual child,

- the variables that account for number of co-owners (if a person has common property with multiple people, the impact of SOP is lower).

The partnership status is modelled with a) logistic regression (forward stepwise selection), b) linear discriminant analysis (forward stepwise selection), c) weighted sum with frequency ratios as weights, d) weighted sum with logarithms of frequency ratios as weights.

In principle, all of the four indices are weighted sums. With the logistic regression and linear discriminant analysis, the model coefficients are used as weights. The threshold for classifying is selected to achieve same proportion of partners as in survey data.

The models include 16–20 explanatory variables. Their performance is similar: 84–86 % of decisions are correct, 4–7% of quasi-couples are non-partners misidentified as partners and 9– 10% are actually partners, but are not discovered by the model.

The combination of four indices is used to predict partnership status for all quasi-couples. About 40,000 lone parents are in the quasi-couples that are classified as partners. In the PC2016, the number of lone parents is overestimated by about 50,000; the index-based approach reduces this gap significantly.

References:

[1] E.-M. Tiit, H. Visk, and V. Levenko, ‘Partnership index’, Quarterly Bulletin of Statistics Estonia, vol. 1, 2018. Accepted.

[2] H. Äär, ‘Coincidence of Actual Place of Residence with Population Register Records’, Quarterly Bulletin of Statistics Estonia, vol. 1, pp. 80–83, 2017.

[3] E.-M. Tiit and M. Vähi, ‘Indexes in demographic statistics: a methodology using nonstandard information for solving critical problems’, Pap. Anthropol., vol. 26, no. 1, p. 72, Jul. 2017.

26

[4] ‘Income Tax Act – Riigi Teataja’. [Online]. Available: https://www.riigiteataja.ee/en/eli/531012018001/consolide. [Accessed: 15-Feb-2018].

1.3 Deriving new variables from administrative variables and ensuring their compliance with census definitions and comparability By Kristi Lehto, Pille Kool, Helle Visk

Revision and improvement of residency index calculations Today, all developed countries struggle with accurate determination or estimation of the population figure. People have become very mobile and are difficult to get hold of and enumerate (Tiit, Maasing 2016).

Between censuses, population size is estimated according to the methodology prepared for the Estonian register-based population and housing census (REGREL), using the concept of register activity of a person, which is called a sign of life1. A sign of life is some kind of activity in a register in relation to a person at least once a year. Only the registers which reflect residence in Estonia are considered (Maasing 2015a,b).

Population statistics required a methodology which would facilitate annual estimations of the population figure and composition with sufficient accuracy, while also providing a means for calculating external migration. In doing so, it is reasonable to take into account a person’s status in the previous years. This problem was solved by defining residency index2 as an indicator of a particular person’s likelihood of being a resident3 in a given year (Tiit 2015b).

The index is calculated annually for all persons who were included in the extended total population4 and were alive at the start of the year; the calculation is based on all available

1 In the statistical sense, a sign of life is a binary characteristic which depends on three explanatory variables – person, register and year –, with value 0 if the observed person was not active in the particular register in the reference year and with value 1 if he/she was active at least once in this register in the reference year. 2 The residency index value varies in the range of 0 and 1. The higher the index value, the higher the probability that a person is a resident of Estonia. If the residency index value is 1, the person is considered a resident. If the index value is somewhere in between, the decision is made based on threshold c: those persons whose residency index equals or exceeds the threshold are considered residents and those with lower index value are considered non-residents. 3 Resident of Estonia – permanent resident of Estonia (Maasing 2015a) 4 Enlarged population – persons in the population of the latest population census in 2011 (PHC 2011) or in the population register (in 2012–2016). Each year, the enlarged population is supplemented based on the information received from the population register. The place of residence of persons in the enlarged population may be in Estonia or abroad or may be missing altogether, or these persons may have been placed in the so-called passive 27 administrative registers (and their independent sub-registers), and on identification of all signs of life established in the preceding year for all persons. From 2016, Statistics Estonia uses the residency index methodology to estimate the population size, and calculates the index using signs of life with logarithmic weights (Tiit, Maasing 2016). The methodology of the residency index is amended on a continuous basis, because additional registers are created and amendments to acts may bring about changes in the data composition of registers or content/meaning of data. Flexibility of the methodology enables to add signs of life and consider the changes.

Many changes and revisions were made in the residency index calculations as at 01.01.2017 compared to calculations as at 01.01.2016:

1. All Statistics Estonia’s data on births (RR5 X-Road, RR webcom, TAI6) were included in the enlarged population. 2. As information about deaths, changes received from RR X-Road were used. Starting from this year, also a data file on deaths received from TAI register is used. 3. The decision about the residency of children under the age of 1 (i.e. children born in 2016) was made based on the residency of the child’s mother starting from this year as follows: - if the mother is a resident, the child is a resident (earlier, the decision was made based on the place of residence of the child in RR); - special case 1: if the child’s state of residence in RR is Estonia and the mother’s immigration is pending7, both the child and the mother are considered residents. - special case 2: if the child’s place of residence in RR is missing or abroad and the mother’s immigration is pending, the child’s index value will be 0.5 – both the child and the mother are considered non-residents in that particular year, but, provided there are signs of life, residents in the following year. 4. Revision of the sign of life pensions (register SKAIS8).

section of the population register. Thus, indices are calculated for 1.5 million persons. This allows to determine as residents also returnees, if they have Estonian personal identification codes. 5 RR – Rahvastikuregister (Population Register) 6 TAI – Tervise Arengu Instituut (National Institute for Health Development) 7 „immigration pending“ – in spring 2016, it was decided that persons who according to the index calculations are immigrants are put on hold for a year, i.e. if according to the residency decision, a person was not a resident in the previous year but is a resident in the current year (i.e. R(2015)=0 and R(2016)=1) and the place of residence in both years is not Estonia, these persons are put on hold. In the following year, these persons are assessed again, i.e. if R(2017)=1, immigration takes place in 2016, and if R(2017)=0, they stay as non-residents. 8 SKAIS – Social Security Information System, from where data on pensioners is received. 28

The dataset of pensions includes information about the state of residence and the symbol of the state in the bank account of the person receiving pension payments. This sign of life was not ascribed to those who do not have Estonia as their state of residence or symbol of the state in their bank account (i.e. whose state of residence of a person receiving pension payments and/or symbol of state of a person receiving pension payments is Estonia). 5. Two new signs of life and introduction of a new register: 1) The sign of life disability and incapacity for work (register PKR9) is ascribed to those with valid disability and/or incapacity for work (incl. those to whom pension for incapacity for work has not been awarded) at least on one day in 2016. 2) The sign of life persons with reduced capacity for work (register TETRIS10) is ascribed to those with valid reduced capacity for work and/or who have filed an application for the assessment of work ability during 2016. This register was not used in the previous years, as it was introduced in Estonia on 01.07. 201611.

In the calculations of the residency index for the following year (i.e. as at 01.01.2018), the following changes are planned to be applied:

I. Revision of the signs of life teachers, students (register EHIS12). Until now, data files were used which are received from the registry holder as at 31 December. From now on different existing files will be used, as the information in different data files (some data are given as at 31 December, other as at 10 November) coincides only partially. Another reason is that in the future we wish to “find” also those students/teachers who give up their studies/work in the spring semester, and who no longer appear in the records as at 1 October. II. New sign of life hobby education (register EHIS), for which data are received from the registry holder once a year as at 10 November and 31 December of the previous year. This sign of life is ascribed to a person who has participated in at least one activity related to hobby education (this way, children and pensioners are included, who might otherwise get few signs of life).

9 PKR – SKAIS subsystem 10 TETRIS – Database of work ability (full and partial) assessment and work ability allowance 11 Information from the administration system for the state information system RIHA 12 EHIS – Eesti Hariduse Infosüsteem (Estonian Education Information System) 29

III. New sign of life participants in ESF13 activities. Only people staying in Estonia can participate in the ESF programme “Promoting adult learning and developing learning opportunities”. Two characteristics are used to find the sign of life: the date of entering the first activity; the date of exiting the last activity, which may occur in the future or be blank if the activity has not been exited. The sign of life is ascribed to those persons who have participated in ESF activities for at least one day. IV. New sign of life change of residence in Estonia, which is found by comparing the data in the population register as at 1 January in the current year and those in the previous year. This sign of life is ascribed only to those persons whose state of residence in both years was Estonia. Population Register includes 8 characteristics based on which the so-called complete address of place of residence is formed. We are interested only in addresses at dwelling level (i.e. at least the name of farm, house number and/or apartment number are known). In addition, the characteristic EL_ADS_OID (i.e. address object code of place of residence) in the population register is used. If the complete address of a place of residence has changed (place of residence address must be given at dwelling level in the current year) or the value of the characteristic EL_ADS_OID is different, the person is ascribed this sign of life. A special case is if the complete address of a place of residence of a person has changed but the ADS_OID has remained the same – then the person is not given a sign of life (included are special cases where the name of a farm or street has been changed but the person lives in the same dwelling). All persons whose complete address of place of residence has changed and is no longer at dwelling level in the current year (i.e. address of place of residence is known, for example, at city level) are not ascribed this sign of life (e.g. a person’s registration of a place of residence has been terminated following a court judgement and he/she has been give an address at municipality level). V. New sign of life TSD (register MKR14). From the register, payments made to resident15 natural persons in the declaration of income and social tax, unemployment insurance premiums and contributions to mandatory funded pension (TSD Annex 1A) in the year of interest are used. The sign of life is ascribed to persons who are present in the data at least once a year.

13 ESF – The European Social Fund 14 MKR – maksukohustuslaste register (Register of Taxable Persons) 15 The term “resident” refers to Estonian Tax and Customs Board definition. 30

VI. Persons in detention facilities (register KIR16). All imprisoned persons who have been in detention facilities for at least a year are considered residents of Estonia (so-called certain event17). VII. Supplementing the enlarged population. There are entries in registers where signs of life are found, which are not included in the enlarged population. These entries and their signs of life are added to the enlarged population.

References:

Maasing, E. (2015a). Permanent residency status determination in register-based census. Master’s thesis. University of Tartu, Faculty of Mathematics and Computer Sciences, Institute of mathematical statistics. [www] http://dspace.utlib.ee/dspace;/handle/10062/47557

Maasing, E. (2015b). First results in determining permanent residency status in register-based census. [www] https://wiki.helsinki.fi/display/banocoss2015/Presentations?preview=/149296295/170626623/ Maasing_Abstract.pdf (17.06.2016).

Tiit, E.-M. (2015b). Residence testing using registers – conceptual and methodological problems. Presentation at 4th Baltic-Nordic Conference on Survey Statistics. [www] https://wiki.helsinki.fi/display/banocoss2015/Presentations?preview=/149296295/17062664Ti it_Abstract.pdf (17.06.2016).

Tiit, E.-M., Maasing, E. (2016). Residency index and its applications in censuses and population statistics. Quarterly Bulletin of Statistics Estonia 3/16, pp 53–60

1.4 Improving the use of administrative data sources Implementation area: in regular population statistics, census, housing statistics This activity covered business processes of data collection and production in official statistics that could use administrative data sources;processes of transforming administrative data into data fit for producing official statistics.

16 KIR – prisoners’ register 17 Certain event – event for which value R(17)=1 is fixed before the calculation of the residency index. 31

1.4.1 Developing cooperation agreements with the owners of administrative data sources By Maret Priima,

SE still has not renewed our cooperation agreement with the Tax and Customs Board. This is due to many different aspects. Firstly, we are constantly finding new datasets to research form the Tax and Customs Board and this makes the process very long as we have not decided on a point in time when the agreement must be signed. It is in constant development. Secondly, due to the fact that the Employment Register will start to collect occupation and local place of work, we will change the data list sent by them and also the automatic data transmission (x-road) is in development. The process should end in the summer so this can also be included in the cooperation draft.

The Border Guard Board prefers that Statistics Estonia sends official requests every year to get the necessary datasets. We are still in discussions about the datasets that the Estonian Rescue Board can give us, the detailed data list is mostly agreed upon. The delay is because they developed a new information system and they were not connected to the Address Data System before to give us and without it our data processing would be too time-consuming. Their information systems are now in place and we are in the process of getting first data from them.

The problems with the cooperation agreements are mostly due to development of information systems or data transmission system. They take a long time and usually are delayed, which makes the planning of these activities difficult.

We achieved the objectives set within the framework of census grant, resulting in increased number of registers (24) used in census, and established partnership and placement indices that facilitate determining household composition based on register data.

The next step is to test indices in surveys, to carry out register-based pilot census in 2019.

32

1.4.2 Register of granting international protection (RAKS) By Pille Kool

The state register of granting International protection includes the personal data of the persons who have submitted an application for residence permit and received a residence permit on the basis of the Act on Granting International Protection to Aliens. The register processes data related to asylum proceedings or temporary protection proceedings conducted on the basis of the said Act.

For evaluating residency for census (establishing the population of permanent residents) it is very important to obtain information on persons who hold a (extended) temporary residence permit or a long-term resident’s residence permit or whose permanent right of residence has commenced or been restored during the period 01.01.2016–31.12.2016.

The 2016 data on persons who were granted a residence permit on the basis of the Act on Granting International Protection to Aliens were received by Statistics Estonia from the register of residence and work permits in two files on 13.02.2017, according to the deadline set in the contract entered into with Statistics Estonia (therefore, the data were received on time, quality 100%).

The first dataset (RAKS_KEHTIVAKS2016) The dataset includes information on persons who were granted a residence permit on the basis of the Act on Granting International Protection to Aliens or whose permit was restored during the period 01.01.2016–31.12.2016.

The dataset includes 248 records. For the residency index, the necessary indicator was the beginning date of the document validity. The dates of all the records were during the time period 01.01.2016–31.12.2016 (data quality 100%). In conclusion, the received dataset was 100% usable for analysing residency index.

The second dataset (RAKS_KEHTETUKS2016) The dataset includes information on persons whose residence permit granted on the basis of the Act on Granting International Protection to Aliens became invalid during the period 01.01.2016–31.12.2016.

33

The dataset includes 175 records. The dataset includes an indicator on the date the document became invalid, all of which are during the time period 01.01.2016–31.12.2016 (which is not directly needed for the residency index). The indicator necessary for the residency index is the beginning date of the document validity. The dataset includes dates during the time period 1991–2015, incl. 1 record in 2016. Therefore, 1 record was necessary for the analysis, but the record was also included in the first file (i.e. RAKS_KEHTIVAKS2016). The given dataset did not include a record of restoring document validity in 2016. Therefore, the given data file is not required for residency index calculations.

1.4.3 Building identifiers for linking administrative data By Krista Türk

For registers it is mandatory to keep ID codes from Population Register for people that are registered there. If a register can hold information about people or events that can occur to foreigners for example and other people without PR ID code, then in most cases registers keep instead of ID code some other code. That can be for example Passport number or made up code from persons birthday or entirely made up fake code that can be difficult to determine.

We have established pseudonymization rules for replacing names and ID codes with statistical codes that hold no information about that person and can only be used to link different data sources in our database. Future activities are related to automation of pseudonymization process where possible.

Data from registers is first pseudonymized and then can be used for statistical work. Pseudo codes are given based on ID code or based on combination of name, sex and birthday. Also we try to determine whether the ID code that is presented is in correct form or not, based on known rules for Estonian ID code.

If ID code is available for pseudonymization input then pseudo code is of good quality. Difficulties in pseudonymization can occur for different reasons. First if there are problems with the quality of ID codes or missing codes then it is important to have name, sex and birthday for that person to give them pseudo code. If any of them are missing or with poor quality, linking different sources can be difficult if not impossible.

34

It is problematic if same person gets different pseudo codes due to different input from registers. In some cases we can determine that if a person with such name, sex and birthday occurs only once in PR then it is good chance that it is same person. For such pseudo codes it is possible to generate table containing both codes and use it to update codes in all linkable data tables to be based on same information.

Also we can improve our own process by always requesting full combination of ID code, name, sex and birthday from registers. Unfortunately at the moment there exist data request from registers that get only ID. In these cases if that IC value is empty or doesn’t match the rules, we have no base information to give it pseudo code and no linking can be done with such information.

1.5 Compilation of activity status based on register data By Kaja Sõstra and Maret Muusikus

In Estonia, a methodology is prepared for conducting register-based Population and Housing Census in 2021 (REGREL). As for census characteristics, the greatest number of databases are used for determining activity status. Estonian Labour Force Survey (LFS) is the best reference source to check the quality of register-based activity status. The following article, which provides an overview of the compilation of activity status and results of the comparison, will be published (with some modifications) in the Quarterly Bulletin of Statistics Estonia No. 1/18.

Pursuant to the definition in the Census Regulation (Regulation (EU) 2017/543), current activity status refers to the current relationship of a person to economic activity, based on a reference period of one week, which may be either a specified, recent, fixed, calendar week, or the last complete calendar week, or the last seven days prior to enumeration.

Employed persons comprise all persons aged 15 years or over who during the reference week: a) performed at least one hour of work for pay or profit, in cash or in kind, or b) were temporarily absent from a job in which they had already worked and to which they maintained a formal attachment, or from a self-employment activity.

The unemployed comprise all persons aged 15 years or over who were:

35 a) “without work”, that is, were not in wage employment or self-employment during the reference week; and b) “currently available for work”, that is, were available for wage employment or self- employment during the reference week and for two weeks after that; and c) “seeking work”, that is, had taken specific steps to seek wage employment or self- employment within four weeks ending with the reference week.

Pursuant to the Regulation, current activity status has the following breakdowns:

1. Labour force/economically active

1.1. Employed

1.2. Unemployed

2. Outside of the labour force/economically inactive

2.1. Persons below the national minimum age for economic activity

2.2. Pension or capital income recipients

2.3. Students

2.4. Others

3. Not stated

Estonian permanent residents who couldn’t be classified as employed, unemployed, below the national minimum age for economic activity, pension recipients or students shall be categorised as ‘others’. Activity status ‘not stated’ will not be used in register-based census.

A person can fall into only one category of the activity status. In ascribing activity status, priority shall be given to persons below 15 years of age. The priority is then given to employeed in preference to unemployed and unemployed in preference to economically inactive. Among inactive persons priority is given to pension recipients in preference to students, and students in preference to others.

Methodology of compiling activity status

Activity status is a census characteristic that cannot be retrieved directly from a single database. Person’s activity status is an indicator that changes frequently over time and there is no register

36 to comprise updated information on all activity status components. Therefore, activity status algorithm uses many databases that allow deciding whether a person belongs to a certain activity status during the reference week. Databases used for formation of register-based activity status are listed in Table 1. Each data source is provided with indication of the activity status ascribed by using that information.

Table 1. Data sources for register-based activity status

Activity Database Abbrev. status Employment register (sub-register of Register of Taxable Persons) TÖR Employed Tax declarations from Register of Taxable Persons: MKR Employed FIDEK form . Business income of a resident natural person Employed E FIDEK form . Income derived in a foreign state indicated in the income tax return A foreign Employed for a resident natural person (separate sheet 8.1) income . Payments made to resident natural persons in the declaration of TSD Annex income and social tax, unemployment insurance premiums and Employed 1 A contributions to mandatory funded pensiona . Payments made to non-resident natural persons in the declaration of TSD Annex income and social tax, unemployment insurance premiums and Employed 2 A contributions to mandatory funded pensiona . Disclosure of recipients of dividends and other equity payments INF1 Employed The Register of Persons Registered as Unemployed and Job-Seekers, EMPIS Unemployed and of Provision of Labour Market Services Social Services and Benefits Registry STAR Unemployed Social Security Information System SKAIS Pensioner Health Insurance Database KIRST Pensioner Mandatory Funded Pension Register KOPIS Pensioner Estonian Education Information System EHIS Student State Register of State and Local Government Agencies, supporting source for combining TÖR and TSD data. Incorporated in the RKOARR Commercial Register since 11.01.17 The terms resident and non-resident refer to Tax and Customs Board definitions

37

Activity status algorithm has a simple overall structure. The reference week is the last full working week before the census moment on 31 December. This article is based on register data of 2016 and reference week 12.12–18.12.2016. At first, separate lists are prepared for the employed, unemployed, pensioners and students. Persons on these lists partially coincide, because a person can classify under more than one activity status (e.g., working student or working pensioner). Furthermore, the lists contain persons not included among permanent residents of Estonia. Census variables are published for permanent residents only, therefore, activity status lists are linked with the list of permanent residents compiled based on residency index methodology that has been used in Statistics Estonia since 2016 (Tiit, Maasing 2016; Maasing, Tiit, Vähi 2017). Ultimately, each person is ascribed one activity status. This is done by following the order of priority of activity statuses established in the Census Regulation, presented in Figure 1 along with relevant data sources. However, in case of persons who meet certain criteria, the preference of statuses is altered to harmonize methodologies of REGREL and LFS and ensure their comparability.

Figure 1. Order of priority of activity statuses and data sources

Status ‘persons below 15 years of age’ is ascribed directly from the list of permanent residents based on age. Finding persons falling into the status category ‘other’ from registers is complicated, because there are no databases regarding homemakers, discouraged or those taking care of family members. As Statistics Estonia compiles the list of permanent residents based on registers, all persons who were not ascribed higher priority activity status, shall be categorised as ‘other’.

The list of persons with activity status ‘student’ is compiled on the basis of data of EHIS as at the end of the year. Information is available about the students in all stages of study (kindergarten, basic school, upper secondary school, and institutions of vocational and higher education), registered in EHIS as at 31 December. In case of students in higher education, those on academic leave shall be separated. Most of them have a registered employment in TÖR and are ascribed as employed. 38

99.7% of the persons on the list of pension recipients originate from SKAIS pension data. The data retrieved from SKAIS involved all persons, who were entitled to receive pension at least on one day of the reference week. The data from KIRST included persons covered with pension- based health insurance during the reference week. The list was supplemented by 34 persons from KOPIS. Majority of persons receiving mandatory funded pension payments were already present in SKAIS data (table 2).

The list of unemployed is compiled on the basis of two data sources, but majority (74%) comes from EMPIS (table 2). This involves both unemployed and job-seekers, who were registered on at least one day of the reference week. Registered unemployed, who account for 98% of the source data in EMPIS, are in the age range of 16-63 years. Job-seekers do not have upper age limit, but minimum age is 13 years. As activity status ‘persons below 15 years of age’ takes priority in terms of preference, it prevents a situation where very young persons are ascribed final activity status ‘unemployed’. STAR data were used to include persons, whose social status has, in the course of the procedure, been referred to as registered unemployed or unregistered unemployed. STAR data are not as reliable as the data from Unemployment Insurance Fund. Registered unemployed do not have a strict age limit as in EMPIS (e.g., a person aged 13 was indicated as registered unemployed in 27 cases). It is impossible to distinguish unemployment during the reference week in STAR. The information regarding social status is fixed as at the date of procedure. There can be several proceedings with different social status per person during a year. Therefore, information on the latest procedure during the reference year is used for the unemployed in STAR. When ascribing final activity status, the unemployed aged 63 and over are re-categorised as ‘pensioner’ or ‘other’.

Table 2. Data sources of pensioners and unemployed

Number Data source of Percentage persons SKAIS 409 591 96,8 SKAIS + KOPIS 11 652 2,8 List of KIRST 1 109 0,3 pensioners SKAIS + KIRST 538 0,1 KOPIS 34 0,0

39

SKAIS + KIRST + 9 0,0 KOPIS KIRST + KOPIS 8 0,0 Total 422 941 100,0 EMPIS 23 652 61,0 List of STAR 9 914 25,6 unemployed EMPIS + STAR 5 202 13,4 Total 38 768 100,0

One problematic issue with register-based activity status consists in unregistered unemployed. There are many unemployed persons who do not remain registered although they have not actually found a job, because that requires performing certain duties. Thus, potential unregistered unemployed are searched from EMPIS and STAR data for the last three years (the most recent entry of each person). A list is compiled of persons, who at some point before the reference week were registered as unemployed and who do not have a more recent employment entry in TÖR. Persons on this list, who would otherwise be ascribed as ‘other’ after ultimate compilation of activity statuses, are ascribed as previously registered unemployed.

Persons on the list of employed are retrieved mainly from TÖR. This includes all employments (except cancellations) that were valid on at least one day of the reference week. The one-hour rule prescribed in Census Regulation cannot be applied to the register-based census. Those employments that were suspended for the entire reference week shall be separated. A single person may have approximately 30 simultaneous employment registrations in TÖR. In most cases, this occurs if the employer is an apartment association. However, approximately 90% of persons have one employment registration at a time (Figure 2). Those with more than one employment registration should be given just one main job for the purposes of the census, which is used for determining occupation, industry, status in employment (employees, employers et al.), and location of place of work.

40

Figure 2. Number of jobs per person in TÖR during 12.12–18.12.2016

For determining a person’s main job, Annexes of TSD (Declaration of Income and Social Tax) are used to link gross income and part-time employment rate, and INF1 to link dividend data. These datasets are not used for adding persons to the employed, because TSD data does not reflect the period of employment. Payment month is noted in TSD Annexes and INF1, but in TÖR employment has a start and end date. Therefore, December and January payments are taken from TSD (two-month average if a person received payments in both) in order to find gross income for as many persons as possible. Linking is complicated by the fact that the employer indicated in TÖR and the person making TSD payment are not always one and the same. Linking is facilitated by the State Register of State and Local Government Agencies, which is used to link all divisions registered in TÖR with corresponding local government or ministry. Local governments are usually the ones who make TSD payments to local government division workers, but they are not consistently indicated in TÖR as payers. Out of 213 local governments, around 100 were indicated as those making payments (prior to the administrative reform). Ministry divisions tend to make TSD payments on their own, but there are some exceptions. For example, Harju County Court makes TSD payments to all other court workers, except for Supreme Court. Ministries of Culture, Finance and Social Affairs make TSD payments to their division workers by themselves. After linking, person’s main job is ascribed, at first according to greater working time rate and, upon equal values, according to greater income. When considering greater income at first, different place of work is ascribed to 6,000 persons. After choosing main job in TÖR, the list of employed is supplemented by recipients of business income indicated in form E and recipients of income derived in a foreign state indicated in form A. Forms E and A contain annual income, thus it is divided by 12 to get person’s monthly income. The result of linking all five data sources is shown in Table 3. Every 41 employed person is ascribed his or her main source – TÖR, Form E or Form A. In case a person is present both in TÖR and Form E, revenues are compared. Form A is ascribed as main source of the person only if such person does not occur in any other data source of employed persons.

Table 3. Data sorces of employed

Number of Data source Percentage persons TÖR + TSD Annex 542 735 86,2 TÖR 41 747 6,6 FORM_E 15 067 2,4 TÖR + TSD Annex + INF1 10 422 1,7 TÖR + TSD Annex + FORM_E 8 859 1,4 TÖR + TSD Annex + FORM_A 5 427 0,9 FORM_A 3 457 0,5 TÖR + FORM_E 631 0,1 TÖR + INF1 445 0,1 TÖR + FORM_A 414 0,1 TÖR + TSD Annex + INF1 + 229 0,0 FORM_E FORM_E + FORM_A 178 0,0 TÖR + TSD Annex + INF1 + 175 0,0 FORM_A TÖR + TSD Annex + FORM_E + 124 0,0 FORM_A TÖR + INF1 + FORM_E 11 0,0 TÖR + FORM_E + FORM_A 11 0,0 TÖR + INF1 + FORM_A 9 0,0 TÖR + TSD Annex + INF1 + 4 0,0 FORM_E + FORM_A Total 629 945 100,0

42

Activity status based on register data 2016

Activity status has been compiled on the basis of data regarding the reference week 12.12– 18.12.2016. Compilation uses the list of permanent residents as at 01.01.17, when the population figure published by Statistics Estonia was 1,315,635. Table 4 presents two versions of register-based activity status middle of December 2016. Version 1 reflects only the result of algorithm considering the order of priority of activity statuses. In version 2, the algorithm includes determination of previously registered unemployed and harmonisation of REGREL and LFS methodology.

Table 4. Two versions of register-based activity status during 12.12–18.12.16

Number of Percentag Number Percentage Difference Activity status persons e (version of persons (version 2) (persons) (version 1) 1) (version 2) 1. Employed 625 391 47,5 609 840 46,4 15 551 1 Previously 1. registered 0 0 10 095 0,8 N/A 2 unemployed Unemployed 33 372 2,5 32 943 2,5 429 2. Below 15 213 609 16,2 213 609 16,2 0 1 2. Pensioner 284 626 21,6 235 487 17,9 49 139 2 2. Student 61 593 4,7 67 282 5,1 –5 689 3 2. Other 97 044 7,4 146 379 11,1 –49 335 4 Total 1 315 635 100,0 1 315 635 100,0 0

Below is a list of conditions that serve as a basis for changing the activity status of some persons in version 2. The number of persons affected by particular condition is given in brackets.

43

• Persons receiving pension for incapacity for work are given the status ‘other’ (42,327 persons).

• Employed persons, whose employment is suspended, are given the status ‘other’ (15,551 persons). The status ‘other’ is given to all persons on parental leave and those, whose employment has been suspended for more than three months.

• Persons with activity status ‘other’ are ascribed as previously registered unemployed, if they are on the list compiled on the basis of EMPIS and STAR data for the last three years and do not have a more recent employment registration (10,101 persons).

• During the reference week, there were approximately 2,500 persons under the age of 25 years, whose pension was suspended, provided that they are not in education. Majority of them were registered in EHIS, while being ascribed as pensioners according to the order of priority. Due to their age, these persons are ascribed activity status ‘student’. Some of them were also under the age of 15 years, but it was possible to ascribe 2,007 persons as students.

In case of all pensioners under the age of 50 years registered in EHIS, priority is given to student status (1,878 persons).

• In case of persons under the age of 25 years receiving survivor’s pension and registered in EHIS, the priority is given to student status in preference to pensioner status (1,544 persons). The rest of the persons under the age of 25 years receiving survivor’s pension are given the status ‘other’ (206 persons).

• Pensioners, whose pension is suspended, are given the status ‘other’ (1,341 persons).

• If a person at the age of 63 or over, who is ascribed as unemployed, occurs in pension- related data sources, then priority is given to status pensioner (158 persons). The rest of old-age unemployed are given the status ‘other’ (17 persons).

The analysis below makes use of activity status based on version 2, in which activity status breakdowns are closer to census definitions.

As for registers, one person may have several different activity statuses at a time. Activity status combinations can be divided into three major groups:

• All data sources referring to the same activity status – 82.7% of persons aged 15 years and over (Table 5).

44

• Data sources refer to at least two activity statuses, combination of which is acceptable, because persons can perform several activities simultaneously (e.g. working pensioner or working student) – 16% of persons aged 15 years or over.

• Data sources refer to at least two activity statuses, combination of which can be deemed controversial (e.g. working unemployed) – 1.3% of persons aged 15 years or over. Percentage calculation left out 97,044 permanent residents in the last column of Table 5, because those persons did not occur in any data sources used for compiling activity status during the census week.

Statistics Slovenia has published an article on register-based activity status, which presents the results as of 1 January 2014 and uses the above division into three groups of combinations of activity statuses (Dolenc 2017). In 75% of cases, Slovenian data sources referred to the same activity status. Combinations of activity status, that could be considered controversial, occurred in 20% of cases. Higher error rate is due to the fact that Slovenia uses data sources also to find persons ascribed activity status ‘other’. In most cases, the source is mandatory health insurance data, which in Slovenian article is considered to have low quality. Activity status ‘other’, in combination with statuses ‘employed’ or ‘unemployed’, are main reasons for higher error rate.

Comparison with LFS

LFS involves all working-age persons aged 15−74 years. Thus, the comparative analysis involves the same age group of persons from register-based census. LFS contains eight reasons for inactivity:

retirement age;

studies;

military service;

illness or injury;

pregnancy or childbirth leave or parental leave;

taking care of children or family members;

discouraged (lost hope to find work);

other reasons.

45

Table 5. Coincidence of activity statuses among persons aged 15 years or over

Number of activity statuses that sources refer to Number of 1 2 3 4 Activity status persons Error Error Error Error Error (version 2) No No Yes No Yes Yes No 451 151 1.1 Employed 609 840 3 995 2 302 786 23 0 661 073 Previously registered 10 095 0 0 0 0 0 0 10 095 1.2 unemployed Unemployed 32 943 25 522 0 7 235 0 186 0 0 235 2.2 Pensioner 235 487 46 151 0 1 0 0 289 2.3 Student 67 282 61 593 5 429 233 0 27 0 0 2.4 Other 146 379 57 134 2 169 78 34 14 1 86 949 831 158 Total 1 102 026 11 692 2 336 1 014 24 97 044 199 717

To compare with REGREL, the reasons are gathered into three categories: retirement age, studies and other reasons. The analysis starts by comparing the general distribution of register- based activity status by using average percentages of LFS for IV quarter 2016 published in database on the homepage of Statistics Estonia (Figure 3). As the data of LFS 2016 are weighted by using population figure as at 01.01.2016, the number of working-age population in LFS exceeds the 01.01.2017 number by ca 6,000 persons.

46

ªGiven unemployed percentage means unemployment rate / percentage of unemployed among labour force.

Lower number of persons with register-based activity status ‘employed’ compared to relevant number in LFS was expected, as it is no secret that there are cases of undeclared employment (Härma 2017). Search for unregistered unemployed from former registrations in EMPIS and STAR was successful, with just a small difference with unemployment rate in LFS. Re- classification of persons receiving pension for incapacity for work under activity status ‘other’ resulted in more similar proportions of pensioners. LFS pensioners do not include any persons under the age of 50 years, but the youngest persons receiving pension for incapacity for work based on register information were at the age of 16 years. Lower proportion of students is surprising. In LFS, working students should also be classified as employed. Pursuant to register- based activity status, approximately 40,000 students were ascribed as employed and approximately 90% of them received income according to TSD. Even the remaining 10% of students with the status ‘employed’ but no links to their income do not cover up to 10,000 difference between students indicated in LFS and register-based students.

Other are also heading towards register-based Population and Housing Census. In Latvia, 2021 census will be carried out by using register data and periodic sample surveys where necessary (Vegis, Klusa 2017). At the Conference of European Statisticians in Geneva, Statistics Latvia introduced their methodology for compiling register-based activity status and

47 compared the results of 1 January 2015 with estimations of Latvian Labour Force Survey. Similar to Estonian methodology, they ascribe activity status ‘other’ to all persons who cannot be ascribed any of the other activity statuses. Comparison with the Latvian LFS revealed that the numbers of persons with activity status ‘employed’ and ‘pensioner’ match rather well in Latvia. Problems occur with unregistered unemployed. In Estonia, we attempted to solve that problem by using former unemployment registrations found in databases when compiling register-based activity status. Latvia considers using imputation in 2021 census to reduce the difference from the LFS. Similar to Estonian results, Latvia has fewer register-based students compared to LFS and more persons with activity status ‘other’. Latvia is in the progress of establishing a register of students in higher education, which should improve the quality of activity status ‘student’.

For the purposes of comparative analysis, activity statuses indicated in REGREL and LFS were compared on the level of person records. As for the data of LFS 2016, the most recent entry for each person was included, i.e. the data closer to REGREL reference week were preferred. As a result, 42% of the records used for comparison originated from IV quarter, 22% from III quarter and the remaining 36% from the first half of 2016.

By using LFS data closest to REGREL reference week, i.e. survey week 50, the total coincidence rate was 85.3% (Figure 4). In order to demonstrate the impact of extended LFS period on the coincidence rate, further three different periods were tested:

 LFS survey weeks 49−52 (December),

 LFS data for entire IV quarter,

 LFS data for entire year.

Using survey weeks 49−52 and IV quarter had no major impact on the coincidence rate, dropping by ca one percentage point. When including in comparison all persons participating in LFS in 2016, the coincidence rate decreased more as expected, i.e. by three percentage points, to be specific. Figure 4 was supplemented with coincidence rates of version 1 of REGREL activity status, to show the extent of improvement after harmonisation of methodologies (ca 5 percentage points for all periods).

48

100 90 80 70 60 50 40 30 20 10 0 Reference week Reference weeks IV quarter Year 50 49-52

REGREL version1 REGREL version2

Figure 4. Overall coincidence of activity statuses in REGREL 12.12–18.12.2016 / LFS 2016

Table 6 presents the coincidence rate in REGREL and LFS IV quarter for each of the five activity statuses. Comparison is based on LFS IV quarter, because there were very few coinciding persons during the reference week (341). The unemployed clearly distinguished from other activity statuses due to lower coincidence of 46%. The coincidence rate of the unemployed is the most affected by its increasing distance from the reference week, because unemployment episodes are generally several times shorter than other activity statuses. Those ascribed activity status ‘unemployed’ in LFS in IV quarter could have had a different status by the reference week. Employment and economic inactivity had much better coincidence rate. The share of ‘employed’ in LFS constituted 93% of the share of ‘employed’ in REGREL, the number of ‘pensioners’ in LFS corresponded to 88% of relevant number in REGREL, and the share of students in LFS was 83% of that in REGREL. Coincidence rate among economically inactive was the lowest in case of activity status ‘other’ (56%).

Table 6. Coincidence of activity statuses in REGREL 12.12–18.12.2016 / LFS IV quarter 2016

LFS IV quarter 2016 REGREL Employed Unemployed Pensioner Student Other Total N % N % N % N % N % Employed 2 774 93,1 45 22,4 39 9,0 52 10,6 65 10,7 2 975

49

Unemployed 36 1,2 93 46,3 1 0,2 7 1,4 55 9,0 192 Pensioner 10 0,3 4 2,0 382 88,0 0 0,0 135 22,1 531 Student 13 0,4 8 4,0 0 0,0 406 82,9 12 2,0 439 Other 147 4,9 51 25,4 12 2,8 25 5,1 343 56,2 578 Total 2 980 100,0 201 100,0 434 100,0 490 100,0 610 100,0 4 715

Quarterly Bulletin of Statistics Estonia No. 3/14 published an article introducing the results of comparative analysis of the results of 2011 Population and Housing Census and LFS (Rosenblad 2014). Using LFS 2011 IV quarter as the period of comparison resulted in the following activity status coincidence rates:

 employed in both – 92%,

 unemployed in both – 49%

 inactive in both – 87%.

Inactive persons were not separately analysed in the article, but when including all pensioners, students and others in Table 6, REGREL and LFS would have a coincidence rate of 86%. Thus, traditional census did not have a significantly better result than the register-based census. The greatest difference in coincidence rate by three percentage points occurred in case of unemployed, but register-based ascribing of this status is more complicated due to great time variance of unemployment episodes and unregistered unemployment.

Table 6 shows that the greatest difference from register-based activity status arises from giving the ‘employed’ persons in LFS activity status ‘other’. These persons work, but have no records thereof in registers. A closer look at the industry of comparable persons, those working in construction sector clearly stand out. The share of persons working in construction sector was 8% of all persons employed both in LFS and in REGREL. Out of the persons classified as ‘employed’ only in LFS, 28% worked in construction. Speaking of envelope wages or undeclared employment, construction has always been the most problematic sector. According to the estimation of the Tax and Customs Board, a quarter of construction sector turnover goes to companies that do not declare wages (Ruuda 2017).

Next step was to study the coincidence rate of activity statuses in socio-demographic groups to find out the population groups with higher and groups with lower coincidence of activity statuses. In comparison by gender, the rate was better in case of women than in case of men.

50

This may be due to a more stable working life of women (Rosenblad 2014). Coincidence rate was 86% for women and 83% for men. Based on ethnicity, the coincidence of activity statuses was the highest in case of (86%). The coincidence rate of Russians also exceeded 80%, but the remaining ethnic groups had the rate of 75%.

Comparison of REL 2011 and LFS revealed that in terms of age groups, the coincidence rate of activity statuses increases with increasing age. The most critical group consisted in persons aged 20−24 (77%). It appeared that the best coincidence of activity statuses occurred in persons at retirement age, because their statuses are generally more homogenous – inactive in most cases. Comparison of REGREL and LFS did not detect an even increase of the coincidence rate with increasing age. The rate improves until attaining middle age and starts to decline again in subsequent age groups (Figure 5). Considering that the comparison of REL 2011 and LFS did not include a separate study of inactive persons, Figure 5 also presents coincidence rates in age groups in case of adding together the sub-categories of inactive persons in the comparison of REGREL and LFS. Similar to the comparison of REL and LFS, this allows stating that coincidence rate increases with increasing age. Age group 20−24 has one of the lowest coincidence rates in the comparison of REGREL and LFS as well (75%). At this age, people obtain professional education and commence working life, which involves frequent transition from one activity status to another. This is confirmed by the figures in Table 6, indicating that 10% of ‘students’ in LFS are ascribed as ‘employed’ in REGREL. From age 60 onwards, coincidence rate of REGREL and LFS declines significantly. This is greatly due to transitions between activity statuses ‘pensioner’ and ‘other’, which was not studied when comparing REL 2011 and LFS, because that analysis used aggregate number of inactive persons. Table 6 shows that 22% of persons with activity status ‘other’ in LFS are ascribed the status ‘pensioner’ in REGREL. According to SKAIS data, in 98% of cases these are persons receiving old-age pension. LFS focuses on working life, and thus the reasons of inactivity are not a priority. As a respondent of LFS, a person receiving old-age pension may choose the reason for inactivity by himself or herself. In 71% of cases, where the pensioner status in REGREL did not coincide with that in LFS, inactivity was due to either an illness or injury. Such information is not available in register-based census and based on priority order, these persons are ascribed as pensioners.

51

Figure 5. Coincidence of activity statuses by age groups REGREL 12.12–18.12.2016 / LFS IV quarter 2016

Summary

For the first time in the history of the Baltic States, Estonia and Latvia intend to carry out 2021 Census based on register data. Out of all census characteristics, the compilation of activity status is one of the most complicated, as it involves many data sources. Currently used 13 databases are sufficient for compilation of register-based activity status. They cover all activity statuses besides ‘other’, which contains persons who are difficult to find in registers. Considering that the structure of the databases is not set in stone, the functioning of the algorithm must be annually checked, and adjusted where necessary.

Comparison of LFS data and register data provides the best way to check the functionality of activity status algorithm. Comparative analysis revealed 85% overall coincidence rate with LFS, which was made possible through harmonisation of REGREL and LFS methodologies. The most problematic were REGREL and LFS activity status pairs ‘other’-‘employed’ and ‘pensioner’-‘other’. During the compilation of register-based activity status, it was clear that the number of employed persons would be lower than that in LFS, as not all employments are registered. Comparison indicated that the greatest cause of decreased coincidence rate consisted in assigning activity status ‘other’ to those classified as ‘employed’ in LFS. In view of the significantly higher share of persons working in construction sector among the persons 52 employed only in LFS, it is obvious that envelope wages continue to be a major problem in our society.

Another major decrease in coincidence rate (REGREL pensioner, LFS other inactive) comes from methodological differences that cannot be changed. Compilation of register-based activity status relies on the use of the order of priority. In LFS, however, a person can choose the reason for inactivity on his or her own, which places considerable number of old-age pensioners under activity status ‘other’ for the reason of illness or injury.

Among the unemployed, coincidence level was the lowest also when comparing REL 2011 and LFS. Unemployment episodes generally last for much shorter period than the other activity statuses and it is not realistic to expect very high coincidence when comparing LFS IV quarter with a fixed reference week.

The results of compilation of activity status based on register data are similar in Estonia and in Latvia. The statuses ‘employed’ and ‘students’ are underestimated and the status ‘pensioners’ overestimated. Overestimating activity status ‘other’ comes from including in this category all those persons who cannot be classified under other categories. As expected, the status ‘unemployed’ is underestimated due to unregistered unemployment, but this was successfully improved by using data from previous years when testing the algorithm applied in Estonia.

References:

Commission Implementing Regulation (EU) 2017/543 of 22 March 2017 laying down rules for the application of Regulation (EC) No 763/2008 of the European Parliament and of the Council on population and housing censuses as regards the technical specifications of the topics and of their breakdowns http://eur-lex.europa.eu/legal- content/EN/TXT/?qid=1491315145905&uri=CELEX:32017R0543%20

Tiit, E.-M., Maasing, E. (2016). Residency index and its applications in censuses and population statistics. Quarterly Bulletin of Statistics Estonia 3/16, pp 53–60

Maasing, E., Tiit, E.-M., Vähi, M. (2017). Residency index - a tool for measuring the population size. Acta et Commentationes Universitatis Tartuensis de Mathematica, 21 (1), 129−139

Härma, K. (27.11.2017). Ümbrikupalgad nõuavad üha suuremat võitlust – Äripäev [www] https://www.aripaev.ee/uudised/2017/11/27/umbrikupalgad-nouavad-uha-suuremat-voitlust

53

Dolenc, D. (2017). Deriving labour force characteristics from multiple sources in the Register- based Census of Slovenia http://www.stat.si/StatWeb/File/DocSysFile/9247

Vegis, P., Klusa, A. (2017). Administrative data and sample surveys’ data usage for determination of economic activity of population in register based Population and Housing Census in Latvia https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.41/2017/Meeting-Geneva- Oct/GE_41_2017_5_ENG_unofficial_correction.pdf

Rosenblad, Y. (2014). On the issues of data comparability: employment and unemployment indicators according to the labour force survey and the census. Quarterly Bulletin of Statistics Estonia 3/14, pp 84–95

Ruuda, L. (26.09.2017). Veerand ehitussektorist maksab ümbrikupalka – Postimees [www] https://majandus24.postimees.ee/4255009/veerand-ehitussektorist-maksab-umbrikupalka

1.6 Adapting quality framework for the evaluation of administrative data By Diana Beltadze, Kaja Sõstra, Maret Muusikus, Kristi Lehto

Statistics Estonia has decided to increase the use of administrative registers for statistical purposes. The target is to prepare for a register-based census using an information technology solution to maximise automated data collection; this requires information technological interoperability.

Estonian register holders have been issued lists of actions required for implementation of a register-based census; these include deadlines and general requirements for database quality assessment, which have to be met by chief processors of databases. In order to conduct the census, the following situation must be achieved by 2021 and beyond:

1. Estonian addresses in all databases (registers) are extracted from the address data system of the Land Board – this ensures standardised address data;

54

2. All persons in databases are identified on the basis of personal identification code (for natural persons) or register code of the commercial register or non-profit associations and foundations register (for legal persons);

3. Data acquisition is based on the X-Road data exchange layer – this reduces the number of technical data errors;

4. The data for register-based census and other surveys are collected according to agreed data structures, which meet the statistical requirements.

It was agreed that register holders and Statistics Estonia have the following targets throughout the period of preparations for register-based census:

1. Ensure implementation of location addresses conforming to the address data system;

2. Ensure that, by 1 December 2017, classifications are administered and used in accordance with the structure registered in the Administrative System of the State Information System (RIHA);

3. Ensure data migration and transfers to Statistics Estonia via X-Road;

4. Ensure, by 1 January 2018, regular updating of data, regular data quality assessments, possibility to link data to personal identification codes of natural persons and register codes of legal persons;

5. Statistics Estonia conducts register data quality assessments according to the Official Statistics Act.

Metadata

As regards to classifications, Statistics Estonia has to approve the classifications used in the databases that are part of the state information system and has to use the agreed classifications in the census round. Harmonisation of classifications and terminology used in administrative registers is one of the key actions in the preparations for register-based census. Standardised metadata specifications will increase the efficiency of data processing.

Administrative registers currently use their established classifications and definitions, which are often different from the internationally harmonised census classifications and definitions. Furthermore, it was discovered during census preparations that some classifications used in

55 administrative registers have not been specified. An important task is compiling metadata for census characteristics. The specifications of characteristics will enable register holders submit data in the XML format, facilitating efficient collection of data based on the SDMS framework.

Stages of quality assessment

1. Development of a quality framework and manual

2. Presentation of the manual to register holders

3. Testing activities according to the manual (Population Register and the ADS system)

4. Analysis of feedback

5. Development of a quality assessment approach, based on the main requirements for a register-based census

6. Presenting the new quality assessment principles to register holders

6. Quality assessment

7. Presenting the results of quality assessment to the census steering group, the census committee, and researchers

8. Follow-up activities for a pilot census

Adapting quality framework for the evaluation of administrative data

It is vital to create the necessary prerequisites for a register-based census and to develop a system of organised and harmonised state registers, which requires contributions towards rising the level of data quality.A system of organised and harmonised state registers is highly valuable to Statistics Estonia and enables significantly reducing the administrative and response burden of enterprises, institutions, organisations and inhabitants, while increasing the capacity to monitor other processes taking place in the society (besides those observed in the census) as well.

During the course of preparations for the register-based census, it has become clear that it is necessary to study the data sources which are new to determine and develop quality criteria for the data that is going to be captured, to assess the quality of data and give feedback on it, to develop rules for using the data in the system of statistical registers and to put in place methodologies for capturing and processing data for the register-based census. The following results are obtained.

56

Methodology for data source quality assesment

The developed data quality instructions for register holders includes methodologies for measuring and ensuring the quality of data in the information system as a whole. The instruction specifies a methodology for managing the monitoring and supervision of database quality and includes recommendations for metrics to be used in data quality monitoring.

The developed framework for data quality management includes three elements:

I. Data quality model for measuring and improving the categories associated with data quality for statistical purposes.

II. Set of data quality indicators, which can be used for testing different aspects of data quality.

III. Framework for data quality management, which is a set of iteratively implementable actions to ensure data quality.

Statistics Estonia has issued requirements for register holders, which they have to meet before the start of the register-based census:

• At least 97% coverage of the population;

• Information on required census characteristics must be regularly updated and this process must be documented;

• At least 95% of entries must have an identifier in the standard format;

• At least 95% coverage of relevant census characteristics;

• The rate of material or technical errors in the values of relevant census characteristics may not exceed 1%.

Data quality assessment

Register holders found the developed manual (see Annex 4) difficult to understand and use.

57

Little progress was made when working according to the manual and, therefore, a decision was made in the spring to simplify the quality assessment procedures for register holders. It was important that all register holders understand the procedures in the same way. It was decided to use the data quality requirements as a basis.

The manual will be used for those registers that are used for the first time for data acquisition.

Data Quality Results

The regular data quality assessment was performed in cooperation by Statistics Estonia and chief and authorised processors of registers in 2017. (Deadline is 1 Dec 2017).

The results of the quality assessment were presented to the REGREL Working Group on Registers on 24 November 2017 and to the REGREL Steering Group on 29 November 2017.

The quality assessment was performed in the registers included in the 1st pilot census. An overview is presented in Annex 1.

Summary: The main factor that undermines the quality of census data is the difference between registered and actual residence information. This situation has a significant impact on the structure of households and families.

The quality of the data in the State Register of Construction Works is inadequate. The low quality of the Register of Construction Works is caused by under-coverage of buildings and dwellings, incomplete data on technical characteristics, and lack of updates.

Verification of classifications

An overview of the status of classification verification is presented in Annex 2. The deadline for verification of classifications was 1 Oct 2017.

Exceptions:

• Register of Residence Permits and Work Permits

• Register of Imprisoned and Detained persons and Persons Held in Custody

58

Summary: The work with classifications has been performed in registers, but the main problems are associated with upgrading the administration of classifications and links to the latest version. As a result, the latest version of the classification of administration and settlement units (EHAK) has not been adopted or the classifications have not been implemented.

Due to the administrative reform, comments concerning the new version of EHAK were omitted from the review of the use of classifications in registers. However, as a singular exception, one register received a note regarding EHAK, because they did not have any integration with EHAK (which is required).

Only one information system – e-File – did not have any issues according to the data structure specified in RIHA. As of 24 November 2017, several information systems (Land Register, Health Insurance Database, National Defence Obligation Register, Social Protection Information System, Mandatory Funded Pension Register, Traffic Register, Medical Birth Register, and Causes of Death Registry) had not updated their links only to the new version of the classification of administrative and settlement units of Estonia, which is applicable since the end of October.

The quality was up to standard and no comments of Statistics Estonia were needed in 9 out of 22 registers. (See Annex 3)

Data exchange ways

As regards to data exchange, we assessed the method of data transmission.

In order to conduct a register-based census, we need to maximise the level of automation in the data collection process, based on an IT solution. The data should be collected from registers and the collected data should meet the statistical requirements. The current time required for collection of data from registers is not acceptable; the time resources required for data acquisition need to be reduced. Problems are caused by the manner of data collection. The

59 quality assessment revealed the following obstacles that prevent optimisation of the data acquisition process:

1. Non-standard data format (e.g., address data);

2. Different methods of data exchange used by different agencies;

3. Uneven data quality.

Conclusion

According to the results the level of improvement was achieved:

1. address standard has been implemented in the Population Register; 2. both register holders and Statistics Estonia work on measuring the quality of data; 3. a secure data exchange platform is available; 4. registers have appointed holders and assigned tasks; 5. census data can be used for methodological work with data; 6. some registers have adopted data quality standards.

The quality assessment identified three main issues that complicate the conduct of a register- based census:

1. For nearly a quarter of the population, the registered place of residence differs from the actual place of residence; analysis covering the 1st and 2nd quarter of 2017 (Eurostat grant Improvement of the quality of EU census (2021 and post-2021);

2. Occupation and workplace location of residents are currently not recorded in registers;

3. The quality of the data in the State Register of Construction Works is inadequate. The low quality of the Register of Construction Works is caused by under-coverage of buildings and dwellings, incomplete data on technical characteristics, and lack of updates.

It was vital to create the necessary prerequisites for a register-based census and to develop a system of organised and harmonised state registers, which requires contributions towards rising the level of data quality.A system of organised and harmonised state registers is highly valuable

60 to Statistics Estonia and enables significantly reducing the administrative and response burden of enterprises, institutions, organisations and inhabitants, while increasing the capacity to monitor other processes taking place in the society (besides those observed in the census) as well.

During the course of preparations for the register-based census, it has become clear that it is necessary to study the data sources which are new to SE (the Employment Register, databases of the Border Guard Board, housing loans), to determine and develop quality criteria for the data that is going to be captured, to assess the quality of data and give feedback on it, to develop rules for using the data in the system of statistical registers and to put in place methodologies for capturing and processing data for the register-based census. For the second pilot census held in 2018, the following was done:

1. Adapted quality framework by the number and types of quality indicators.

2. The identification and comparision of all the quality aspects identified for administrative data(source, metadata, data and process)for regsiter-based census and other statistics.

3. Set up of procedures for the quality assessment of registers at organizational level which should be revised during the second PHC pilot.

1.7 Case study”How the quality of register was measured for Estonain Building Register (EBR)? By Vassili Leveneko

The background

In 2008 ADS got a copy of a part of EBR: buildings with their parts (apartments and not living spaces) namely geographic information - addresses of buildings. Since then all „new“ houses and their parts should have identical geographic data in both registers – ADS and EBR;

At the same time ADS got all Estonian buildings from the Estonian Topography Database (ETAK). These buildings are for sure all Estonian buildings because they are obtained by orthophoto method regularly.

61

From that time the task was to match buildings from EBR with buildings from ETAK. This task completely is not solved for today.

There is a legal act in Estonia that when people want to register their place of residence or they want to buy or sell an apartement or a non living space in a building it must be in ADS database (must have address and address-object ID). Beginning from 2012 there was decided to allow to input missing in EBR parts of buildings into ADS only, because of bureaucratic reasons it was not possible to input such information into EBR.

So beginning from 2012 EBR under-coverage of parts of building began to accumulate into ADS.

Under-coverage of EBR buildings

ADS has buildings from two databases: Estonian topography database (ETAK) and EBR. ETAK building database has a unik buildings id (ETAK_ID) and consists of all Estonian buildings. EBR part of buildings has a unik buildings id (EHR_KOOD) and consists of buildings from EBR. Both parts have an ADS unic address-object ID (ADS_OID).

If ETAK-building is matched with EBR-building, then instead of two old records one new record is formed, this record has both ID – ETAK_ID and EHR_KOOD. At every moment ADS database has three types of records for buildings: - B_2: records with both ID (matched buildings); - B_1_ETAK: records with only ETAK_ID (not matched, ETAK); - B_1_EBR: records with only EHR_KOOD (not matched, EBR).

The under-coverage (UC) of EBR is:

UC = B_1_ETAK – B_1_EBR,

At 01.01.2018 UC=806 – 697 =109 thousand buildings which makes 15.6% of all buildings. Additionally ADS has a special unik ID ADS_OID which tell us is buildings residential or not. It gives REGREL possibility to get under-coverage of residential (UC_R) buildings: UC_R = B_1_ETAK_R – B_1_EBR_R,

62

Where B_1_ETAK_R and B_1_EBR_R are residential parts of building respectively of B_1_ETAK and B_1_EBR. At 01.01.2018 UC_R=287 – 271 =16 thousand buildings which makes 4.5% of all residential buildings.

Total population of dwellings (TPD)

TPD consists of two big parts – occupied living spaces and unoccupied conventional dwellings.

Previously were shown that EBR has under-coverage both in buildings and in parts of buildings. On the other händ ADS has all buildings and parts of buildings – that’s why ADS database is chosen to be the source for TPD.

To get the occupational part of TPD (TPD_occup) we need to have total population of persons (TPP). Each person in TPP has an address-object for his place of residence (ADS_OID) – the same ID has an ADS database. To get TPD_occup we jõin TPP and TPD by ADS_OID. For REGREL output purposes we need to know is occupied living space conventional dwelling, collective quarter or other housing unit. To know that we need to jõin TPD_occup with EBR to see main_purpose_of_use of buildings. This jõin is possible only for buildings with EHR_KOOD.

Unoccupied conventional dwellings (TPD_unoccup) are either unoccupied apartments (TPD_unoccup_A) in a buildings or buildings for one household (TPD_unoccup_B):

TPD_unoccup = TPD_unoccup_A + TPD_unoccup_B.

TPD_unoccup_A is simple to get because parts of buildings in ADS have marks of two types: apartments (ER) and non living spaces (MR). All occupied apartnets are already marked, we should take apartments which are not marked as places for residence. To get TPD_unoccup_B we have to jõin unoccupied buildings with EBR and to see main_purpose_of_use for the buildings and chose appropriate one.

Currently the last TPD got for the moment 01.01.2017.

At 01.01.2018 there were 3000 buildings without main_purpose_of_use which makes less than 0.5% from all EBR buildings.

63

Technical characteristics for TPD

Individual technical characteristics for TPD taken from EBR are:

- Type of living quarter; - Floor space / number of rooms; - Water supply system; - Toilet facilities; - Bathroom; - Type of heating; - Period of building’s construction;

Definitions

The definitions of technical characteristics in EBR were assessed in terms of compliance with the definitions of the Regulation at the time of the First Trial Census in 2016. They were fully compliant in most cases.

Coverage

Coverage is one of the main quality criteria for censuses. This estimate is based on the share of missing values of individual characteristics.

Some EBR quality numbers for population of dwellings at 01.01.2017:

EBR With help of Dwelling population Item coverage Census_2011 OCC and (CDW or OHU) floor space 95,5% 96,4% OCC and (CDW or OHU) number of rooms 94,9% 96,3% OCC and (CDW or OHU) water supply system 84,6% 99,6%

64

OCC and (CDW or OHU) toilet facilities 80,5% 99,4% OCC and (CDW or OHU) bathroom 72,4% 99,6% OCC and (CDW or OHU) type of heating 85,3% 99,6% period of building’s CDW construction 67,8% 93,7%

Where OCC is „occupied dwelling“; CDW is „conventional dwelling“; OHU is „other housing unit“.

1.8 Developing statistical register systems for managing statistical registers of population, businesses, agricultural holdings, buildings and dwellings.

1.8.1 Working out methodology for housing data management in statistical register system By Svetlana Šutova

The development of methodology for housing data management in the statistical register system (SRS) has been based on the principle that all data in registers should be provided with information on the sources of data, the persons who modified data, and the dates of modifications. New classifying and quantitative indicators can be added to a register object. In addition to a constantly updating database, the register also holds copies of an annual dataset.

The basic data for housing statistics should be maintained in the SRS register of buildings and rooms, which is used to manage data that can potentially include objects found in three SRS registers – economic entities, agricultural holdings, and persons.

1. Adding new objects to the SRS register of buildings and rooms.

The original source of information on buildings and rooms is the ADS (Address Data System), which provides address data and data of address objects. Only non-residential premises, non-

65 residential buildings and cadastral units associated with residential buildings and individuals or businesses will be selected from the ADS address objects for entry into SRS.

The ADS data will be acquired through X-Road from the database maintained by the Land Board, and it will be maintained in the SE’s source database for address data. The acquisition of data from ADS to the source database of addresses will be performed once every 24 hours, during night time.

On entry in the source database for address data, SE will generate an internal building ID and a survey area code. Before ADS data are entered in SRS, they undergo initial verification while dummy rooms are created as well. The practice of creating dummy rooms was introduced due to the requirement of having persons in a statistical register linked to an address at a room level. This requirement, in turn, is important because of the methodology of establishing households based on addresses. As the residence address data obtained from primary registers can be incomplete and sometimes the room level data is missing in case of persons living in a building with several dwellings, we decided to create a dummy room for each building, irrespective of the actual presence of a room.

A special buffer was created to transfer ADS data to SRS. The buffer is used for adding new address and object records, as well as for correcting earlier records. Data are transferred once every 24 hours.

The following steps are used for transferring data from the ADS buffer to the tables of the SRS database.

1) Address components are loaded into the SRS base table ADDR_COMP.

2) Addresses are loaded into the SRS base table ADDRESSES.

3) Invalid addresses obtain respective flags in the base table ADDRESSES.

4) Address objects are loaded into the SRS base table ADDR_OBJ.

5) Invalid address objects obtain respective flags in the base table ADDR_OBJ.

6) Links between address objects and addresses are loaded into the SRS base table ADDR_OBJ_ADDRESSES.

7) New address objects of types ‘EE’ and ‘ER’ are entered in the SRS register of buildings and rooms.

66

Thus new buildings and rooms are added to the SRS only with ADS address objects through the ADS buffer. This ensures 100% equivalency between datasets maintained in different applications of SE.

If a new address is obtained in ADS for an invalid address, the object link will automatically update to this address in all SRS registers. This function is required because, in addition to adding new addresses and address objects to ADS, there are also continuous efforts to improve the quality of existing data – data are specified and duplicates are eliminated.

2. Adding updates to the SRS register of buildings and rooms.

The data on construction year and technical characteristics of buildings and rooms will be obtained from the Construction Register. These data are acquired through X-Road and maintained in the source database of the Construction Register. Currently, data acquisitions are performed once per quarter.

Acquiring data updates in the SRS register of buildings and rooms is enabled through a universal SRS buffer. The required frequency of data acquisition will be established at a later date, based on the needs of data users. Logical data checks and preparation of data is performed in VAIS.

The following steps are used for transferring data from the buffer to the tables of the SRS database:

1) Data of buildings and rooms are updated

2) An error report on the data update session is issued

The following procedures are enabled in data updating: object identification, adding new value for a characteristic, adding the expiry date for a characteristic, adding only missing values, etc. The specific procedure code is added to data in VAIS(data processing information system), as part of the data update preparation package for the register of buildings and rooms, before the transfer to the SRS buffer.

In the SRS, data updates can be entered in base records, as well as different annual records. The type of target record for updated data is added to data in VAIS(data processing information system), as part of the data update preparation package for the register of buildings and rooms, before the transfer to the SRS buffer. As mentioned above, audit fields – data source, person

67 who made the change, and date of change – are preserved when data updates are added to the register.

The procedure codes and types of records for regular data updates will be specified once, at the programming stage of the data update package. In case of non-recurrent data records, the user can specify procedure codes and record types as required.

1.9 Integration of a new administrative sources Main activities: • Building identifiers for linking administrative data; • Integrating a new administrative source in statistical production or making better use of a source which has already been used for the first pilot Census in 2016. • Deriving new variables from administrative variables and ensuring that they comply with the census definitions and are comparable. • Developing register- based housing statistics • Developing register-based study migration

1.9.1 Developing register-based housing statistics By Vassili Levenko

Definitions

Dwelling – conventional dwelling, collective living quarter or other housing unit (types of dwelling).

Conventional dwellings – structurally separate and independent premises at fixed locations which are designed for permanent human habitation.

Collective living quarters – premises which are designed for habitation by large groups of individuals or several households and which are used as the usual residence by at least one person (social welfare institutions, dormitories, prisons, convents, monasteries).

Other housing units – occupied non-residential premises (huts, cabins, shacks, shanties, caravans, houseboats, barns, mills, caves or other shelters).

68

Occupant – Estonian permanent resident living in the premises. For more detailed information on the determination of Estonian permanent residents, please see information on website: Implementation of the residency index in demographic statistics.

Occupied dwelling – at least one resident has registered it as the place of residence in the Population Register.

Floor space of a dwelling – floor space measured inside the outer walls excluding non-habitable cellars and attics and, in multi-dwelling buildings, all common spaces.

Room – space in a housing unit enclosed by walls reaching from the floor to the ceiling or roof, of a size large enough to hold a bed for an adult (4 square metres at least) and at least 2 metres high over the major area of the ceiling.

Amenities – piped water, flush toilet, bath or shower, central heating.

Central heating – dwelling is considered as centrally heated if heating is provided either from a community heating centre or from an installation built in the building or in the housing unit, established for heating purposes, without regard to the source of energy.

Dwelling with amenities – the dwelling has piped water, flush toilet, bath or shower and central heating.

Dwelling without amenities – the dwelling lacks at least one of the four amenities.

Methodology

A. Data collection

The data for dwellings have been collected from registers. The main registers are the following:

1. Address Data System (ADS); 2. Population Register (PR); 3. Register of Construction Works. In addition to the main registers, information from the Register of Prisoners and Probationers, State Pension Insurance Register, etc. have been used.

B. Data processing

The total population of dwellings consists of all dwellings in the ADS as address objects as at 1 January.

The total population of dwellings consists of occupied and unoccupied dwellings.

69

It is not possible to assign all permanent residents of Estonia to dwellings, as some of the PR addresses are of poor quality and cannot be matched to address objects in the ADS, and some addresses are only at the municipality level. As at 1 January 2016 and 1 January 2017, the number of permanent residents not assigned to dwellings was approximately 3.5% of the total number of permanent residents.

C. Confidentiality

The dissemination of data collected for the production of official statistics is based on the requirements provided for in §§ 34 and 35 of the Official Statistics Act.

D. Classifications

Classification of Estonian Administrative Units and Settlements

Scheme for forming the dwelling processing base

Land register Building register

Owners and ownerships

Technical data Broadened set of Total set of Info about ownership populatin dwellings

Technical data

Owners and ownerships

Population and Dwelling Building register Housing Census processing base 2011

70

The source of the characteristics of housing output is the dwelling processing base, which summarizes all the variables necessary for the formation of the characteristics of the dwelling.

Results:

1)Algorithms(ANNEX IV)

2) In February 2018 Statistics Estonia for the first time published 5 housing tables:

RSE01: DWELLINGS BY COUNTY;

RSE02: CONVENTIONAL DWELLINGS BY YEAR OF CONSTRUCTION AND COUNTY

RSE03: OCCUPIED DWELLINGS (EXCL. COLLECTIVE LIVING QUARTERS) BY NUMBER OF ROOMS AND COUNTY

RSE04: OCCUPIED DWELLINGS (EXCL. COLLECTIVE LIVING QUARTERS) BY FLOOR SPACE AND AMENITIES

RSE05: OCCUPIED DWELLINGS (EXCL. COLLECTIVE LIVING QUARTERS) BY PRESENCE OF AMENITIES

1. Dissimination of housing statistics in 2018„Three times more dwellings in Estonia compared to 100 years ago“ https://www.stat.ee/news-release-2018-019

1.9.2 Building identifiers for linking administrative data

By Krista Türk, Helle Visk, Pille Kool

All the databases that transmit data about persons use personal ID codes. The creation of pseudo ID codes is described in chapter 1.5. Persons who do not have a correct Estonian personal ID code, are assigned a pseudo ID code based on the person’s name, sex and date of birth.

An exception was the SKAIS database (Estonian National Pension Insurance Register), which transmitted data with their internal ID codes, and sent separately also the connection between the internal ID code and personal ID code. The reason was that the register wanted to provide 71 the connection only once (in one table), although the register transmits more than 20 different datasets. Problems were caused because the register has two subsystems with different ID codes. It was not always clear in the case of the variable names in the datasets, which internal ID code was used. In addition, some connections between the internal ID code and personal ID code were missing, as some datasets included more than one column with internal ID code, and these connections were not transmitted. Some datasets were transmitted repeatedly, which also increased the number of missing connections. SE repeatedly requested the missing connections and compiled the list on the basis of the received data.

As a result, SE got five different tables with internal ID code and personal ID code connections, which we combined together, after which there were still a few missing connections.

Lesson learned: SE is renewing the data exchange contract with Estonian National Insurance Register. Personal ID code will be included in each dataset to avoid missing connections between register’s internal ID codes and personal ID codes (from PR).

1.9.3 Deriving new variables from administrative variables and ensuring that they comply with the census definitions and are comparable. By Kristi Lehto

A thorough analysis of the compilation of the variable “activity status” on the basis of register data and in comparison with Labour Force Survey data has been given in chapter 1.6.

The main difference in the definition is the length of working time. According to the regulation:

Employed persons comprise all persons aged 15 years or over who during the reference week: a) performed at least one hour of work for pay or profit, in cash or in kind, or b) were temporarily absent from a job in which they had already worked and to which they maintained a formal attachment, or from a self-employment activity.

Based on register data, it is not possible to use an hour of work as a time criterion. The smallest unit of time that can be used on the basis of register data is one day. This is possible, as the start and end date of employment are registered in the Employment Register (established 01.07.2014). Before the creation of the employment register, the shortest time period in the

72 registers regarding employment was one month, because tax returns are monthly and the difference from the international definition of employment.

Data for the mandatory variable of the census, “Year of arrival in the country since” is available in the population register. According to analyses, the variable in the population register cannot directly be used as the census variable, as in the first half of the 1990s, when the register was created, 1994 or 1995 was marked as the year of arrival in the country for many persons. According to migration data, immigration numbers in the 1990s were not that large. To solve the problem, we are using the data of PHC2011 and different population register variables (entry creation date, country of birth) so that the created census variable would correspond as closely as possible to the definition in the regulation.

1.10 Developing statistical output production By Ülle Valgma

Objectives

1) Overcoming confidentiality problems (safeguards and methods to protect confidentiality, anonymization of datasets).

2) Geocoding of address and building registers for statistical purposes missing codes and update codes.

3) To implement method for privacy and security, to work out security measures required for grid statistics for census 2021

Results:

1) Releasing GRID-based data with resolution 1 km x 1 km for the whole territory of Estonia and the grid map with resolution 500 m x 500 m

2) Study migration and blog

3) Study of confidentiality issues of PHC 2000 and PHC 2011 datasets for releasing grid- based data

73

Revison of confidentiality rules

The grant works planning because of the regulation for the dissemination of a set of 2021 census data at the level of a 1 km² grid.

The issue of confidentiality was not discussed in greater detail earlier. During the grant implementation it became clear that the idea suggested in the regulation that the total number of population should not be confidential is not acceptable for Estonia.

We tested different methods how to publish data on a grid map: if less than 4 inhabitants are living in the grid, the value of the grid is published as <4; additionally, in such cases the age- sex distribution is not published. The data on uninhabited buildings are not published either, since we do not want that empty buildings could be identified based on these data. Therefore, the displayed value of the population in these grids is <4.

While the grid map data published thus far have first and foremost shown the existence of a phenomenon and its frequency – number of residents, number of people with higher education, number of unemployed persons, etc., we tried to develop a rule for publishing values instead of frequency. For instance, there is a considerable need for data on the income of residents. We have the register-based data of the Tax and Customs Board on the gross income of employees. From the Estonian National Insurance Board, we get data on allowances. These data are even more sensitive than age distribution data.

We began by grid mapping the number of persons receiving gross income. As a result, it was found that the confidentiality rule should be applied to approximately 50% of the grids populated with persons receiving gross income and the data should not be published. In the case of the publication of values, besides not publishing low values, there is a rule according to which a person’s/enterprise’s data should not exceed 90% of the total regional value, and, therefore, more grids need to be hidden. The map published according to the described confidentiality method is shown in Figure 2.

As we wished to publish the results in the map application, where data can be viewed for precise locations together with base map/ortophoto and also an address-based search can be conducted, we tested the map on a small share of potential users. We introduced the gross income grid map at several presentations as a static map. In smaller regions, this always generated considerable interest and discussion about who the person might be who takes the gross income in the particular region to such a high level. Such discussions broke out mainly as regards higher gross

74 income. Average and small gross income did not provoke such discussion. This led us to a conclusion that we cannot publish a gross income map according to the confidentiality rules applied. We tried to raise the approximation accuracy from 3 to 5 and 9, but were not satisfied with the result, as the map still contained separate grids with high average income, which could raise questions.

We therefore tested various spatial statistics pattern tools to publish statistical patterns, if they appear, instead of exact gross income values.

Figure 1. Number of persons who received gross income in 2016.

75

Figure 2. Average monthly gross income per employee, 2016

Figure 3. Results of Hot-Spot Analysis, average monthly gross income, 2016

76

The parameters used in the analysis:  Input value: average gross income  Conceptualization of Spatial Relationships: Fixed distance Band  Distance method: Euclidean Distance  Standardization: none

The end result is quite logical, highlighting the areas with higher average income around and Tartu and in the coastal region. The latter is most probably the location of the residents’ second place of residence and during the period of work they still live in Tallinn or in the surrounding areas. A Cold Spot in Narva is also quite logical, which is characterised by a large share of persons who immigrated during the Soviet Union period, inadequate knowledge of Estonian among the residents and high unemployment, as many enterprises in the region have been closed down. Such results, however, are probably difficult to understand for the regular user. While this method guarantees confidentiality, a large part of the grid are statistically insignificant, which means that it is not practical to publish the results along these lines as a grid map.

77

Therefore we may conclude that we can publish data representing value on grid maps only about larger cities, where the number of residents is large enough and therefore there is no risk that residents may be identified. Other methods, e.g. publishing the results by way of statistical patterns may be difficult to interpret for the general public, and it is also not practical to publish them by means of grid maps.

In Estonia, we have an experience with the swapping method, when a few years ago, a GIS hobbyist published an interactive map created on the basis of grid map data, where the number of residents and their ethnic nationality on the map was building-based. The data was based on grid map data published by Statistics Estonia and data of municipalities on ethnic nationalities. The values that we had hidden were replaced randomly and the distribution into buildings was also random. The map spread quickly in the social media and got a lot of feedback. People were upset that there were too many residents in their houses or that residents had been placed in their sheds/greenhouses. Although the map came with meta-information clearly stating that the distribution of residents is random, this fact escaped the users. The Ministry of the Interior requested the removal of the map from the internet. Using the swapping method on a grid map may bring about a similar reaction, although we cannot be completely sure of that today. However, due to the above-mentioned negative experience, we will not use the swapping method, as we do not want negative attention, which may tarnish Statistics Estonia’s image.

Today we are using the perturbation method, and it has proved successful. Users have accepted it and are used to it. In the interest of consistency, we do not often want to change our methods. We also do not want to use different methods to guarantee confidentiality when we publish various data.

Challenges: It is evident for Estonia that there is a strong interest in balancing data confidentiality and data quality. We should be able to provide for and learn from the fact that data miners can combine data from diverse sources with new technology and methodological tools. There is a need to also conduct cognitive research regarding the factors influencing data privacy. Data quality varies according to user and use. When perturbation is applied, it is difficult to determine the optimum balance between confidentiality and quality.

The rules were not so stringent when the 2011 census data were published. The rules were made harder with the publication of the web map application, which made it possible to see the grid map data simultaneously with the base map. When the 2020 population census data will be

78 published, we can publish to the Eurostat the precise number of population and sex-age distribution data, but we wish some of these values were labelled with the confidentiality flag. These data should be treated by Eurostat as confidential in the future.

Grid map resolutions

2) A grid map with resolution 1 km x 1 km has been published for the whole territory of Estonia.

3) A grid map with resolution 500 m x 500 m has been published for all cities and the adjacent areas. All cities that can be classified as cities according to the classification of the Estonian administrative units and settlements (EHAK), have been taken into account, irrespective of the number of inhabitants and population density. The number of population and population density are not considered. Now we reviewed this approach because many 500-resolution grids were sparsely populated or unpopulated. The new thresholds for 500 m x 500 m grids take into account the density and number of population. We have selected areas with population density of at least 200 people per square kilometre where the number of population in an area with that density is at least 5000. A 500m x 500m grid map within these limits will be published starting from 2018.

4) SE has decided: population grip maps with resolution 100 m x 100 m will be published only for five major cities. These cities are: Tallinn, Tartu, Narva, Pärnu, and Kohtla-Järve.

5) We also noted for future censuses that the following rules will be used for all grid resolutions: small values are not published, displaying <4 and in these cases, sex-age distribution is not published either. If a grid contains only one detached house where nobody lives, the number of population is published as <4. In addition to the number of population, we also count the number of dwellings: if the number of dwellings is smaller than two, the age-sex distribution would not be published.

79

Grid map data

The grant work demonstrated that problems with grid based data currently result from the quality of registers. There are people in the registers (according to estimates, approximately 10%) who cannot be linked to building level. Some of them because the addresses are not correct, but for a majority the address is known only at the local municipality level. We do not know yet how this will be affected by the law amendment that will come into force at the beginning of 2019, which allows that a person has no residential address in the population register. Today, we are using a solution where people without a detailed residential address are linked to the administrative unit centre. However, this has led to overestimations in the grids where these centres are located.

For future, the grant work taught that the birth/death data cannot be published in grids for all Estonia because these events are too few and values 1 and 2 cannot be published. The absolute figures of births and deaths probably can be published as a grid map with resolution 1km for major cities, but definitely not different sections: for example, number of births by age of mother or birth order of the child.

1.10.1 Thematic studies

Study Migration and Grid Map of Daytime Population: Feasibility of Register Based Production

A major purpose of population censuses has been to get information about small regions and to identify day-to-day migration between work and place of residence. To solve various situations and reduce risks, we need to know the location of population in the daytime. PHC 2021 in Estonia is being planned to be conducted on the basis of registers. Therefore, it is necessary to evaluate whether the data in the registers are suitable for analysing the study migration and mapping of daytime population. The suitability evaluation was based on a comparison with the respective results of PHC 2011.

80

Study migration By Maret Muusikus and Ülle Valgma

Methods

The place of residence data of students are from the population register as at 1 January 2016. The population register has been linked to the address data system for a couple of years already and the excerpts from the register therefore contain identifiers of the address objects, which enable to add places of residence to the map with minimal effort. The ADS system works very well for new place of residence entries where the registration of a new event or change of residence has required updating of the address or adjustment of the existing one in the Population Register. A problem is linking of old addresses to the ADS system, i.e. those addresses which were not required to perform acts in the population register. Such addresses that are not geocoded to the level of dwelling/building, are linked to the nearest possible object – either settlement, municipality or city district centroid.

Information about students and schools to which they attend corresponds to the situation in the Estonian Education Information System (EHIS) as at 1 January 2016. EHIS has also implemented the ADS system addresses; however, the addresses were not correctly normalised for all schools. Feedback was given to EHIS about the inaccurate addresses identified in the current analysis and hopefully they correct their data. Within the grant project, only the study migration of secondary school students was analysed because it enabled comparison with the population census data of 2011.

To use the address data system, public geocoding services have been developed (http://xgis.maaamet.ee/adsavalik/ads). The locations of schools were geocoded based on their addresses, using the named services.

Distance to school was calculated using an ArcGIS Network Analyst. The distance to school for each student by road as well as distance to the nearest school has been calculated. The road information originates from the Estonian Land Board Topographic Database (ETAK). The students whose place of residence was not known to the accuracy of building were omitted from the analysis of distances. This is due to the quality of residence data in the population register where the place of residence of some people is known only to the accuracy of municipality or settlement unit.

81

ArcGIS NetWork Analyst extension was used to calculate the nearest school for each student in order to compare the actual and shortest distance to school.

Results and analysis

Most frequently, students go to a secondary school that is near home (Figure 1). One third of the secondary school students go to a school that is within 2 km from home. The median distance to school for secondary school students is 4.3 km. Their distance to school is longer mainly because there is no secondary school in every municipality, some students have special needs due to their health, a school is selected based on the elective subjects/field of study and language of instruction. Additional factors are the location of non-stationary studies /secondary schools for the adults, which are more sparsely located than ordinary secondary schools. Average distance to school is 15.4 km. Very long distances to school are very few.

A secondary school near home does not always mean the nearest secondary school. 28% of the secondary school students go to the nearest school calculated based on the road network. The actual as well as nearest secondary school is located in the same municipality for 75% of the students. The respective percentage for the county is 91%.

Figure 1. Distance to school

4500 4000 3500 3000 2500 2000 1500

Numberstudents of 1000 500

0

0-0.9

1.0-1.9 2.0-2.9 3.0-3.9 4.0-4.9 5.0-5.9 6.0-6.9 7.0-7.9 8.0-8.9 9.0-9.9

50.0-99.9 10.0-10.9 11.0-11.9 12.0-12.9 13.0-13.9 14.0-14.9 15.0-15.9 16.0-16.9 17.0-17.9 18.0-18.9 19.0-19.9 20.0-49.9

200.0-333

100.0-149.9 150.9-199.9 Distance, km

82

The students with special educational needs have the longest average distance to school: Helen school in Tallinn (142 km), Emajõe school in Tartu (128 km). These are followed by schools with a domain specific field of study – Noarootsi Gymnasium (119 km), Audentes Sports Gymnasium (91 km), Nõo Science Gymnasium (75 km) and Tallinn Music High School (64 km). Since there are few secondary schools for the adult, the average distance to school for the adults obtaining secondary education is 22.7 km.

Map 1 depicts the secondary schools with different catchment areas. For example, the catchment area of Nõo Science Gymnasium is definitely all Estonia. At the same time, the secondary schools of Taebla and Pärnu-Jaagupi are local schools. The service area of the secondary school in Vändra is somewhat larger because many neighbouring municipalities do not have a secondary school. Tallinn Õismäe Russian Lyceum is rather a local school with a few students coming from afar. Very long distances to school may also mean that the place of residence in the Population Register is not the actual living place as in Estonia there are some discrepancies between actual and registered place of residence. However, this analysis does not enable to say that these are inaccurate data. Most probably, very long distances to school are not travelled every day and these students could have two places of residence – temporary and permanent. For example, Nõo Science Gymnasium has its own dormitory, which is not students’ permanent place of residence.

Map 1. Catchment areas of the selected secondary schools

83

Compared to the commuting analysis of secondary school students of 2011, the results are similar (see map 2 and map 3). The most noticeable changes are caused by some secondary schools being closed down, which leads to changes in regional study migration flows. The most vivid example is the closing of Käina secondary school, as a result of what students all across Hiiumaa migrated to Kärdla in 2016. In 2011, Käina secondary school had a clear catchment area. At the same time, the closing down of secondary schools in Järvakandi and Turba, for example, did not involve significant changes in the student migration flows because in 2011 these schools had no distinct catchment areas.

84

85

1.10.2 Methodology for compiling daytime and night-time population, study migration By Ülle Valgma

Daytime and night-time population was applied to 01.01.2016 Estonian population, when the number of permanent residents was 1 315 944.

1. For night-time population, a person’s place of residence address is used. Persons with missing place of residence were excluded from daytime and night-time population.

2. Daytime population is based on register-based activity status during 14.12– 20.12.2015.

2.1 Place of residence address is used for persons categorised as ‘unemployed’, ‘pensioner’ or ‘other’.

2.2 Location of school/kindergarten is used for students and persons who were below 15 years of age. Estonian Education Information System (EHIS) holds data about educational institutions and their addresses. Each educational institution has a unique ID, which can be used to link its address to a student. If a person below 15 years of age was not registered in EHIS at the end of 2015, their place of residence address was used instead.

2.3 Location of place of work is used for employed persons whose main source is Employment register (TÖR). TÖR will start collecting data about occupation and location of place of work in 2018, therefore this data was not available for compiling daytime population for this grant. Structure of Earnings (SES) sample survey from October of 2014 and Statistical Register for Enterprises (SPI), a register managed by Estonian Statistics, were used for determining place of work. If an enterprise operates from more than one place, SES data is used to randomly sort employees into local units (i.e. part of an enterprise) based on SES distributions. Enterprise’s registered location is taken from SPI if it has only one place from which it operates or is not sampled in SES. As a result of these conditions, 22% of employed were assigned place of work from SES and 77% of employed from SPI. For 1% of the

86

employed location of place of work was unknown and persons place of residence address was used instead.

The same methodology is used for students and persons who were below 15 years of age to compile study migration. In addition, employed persons who were registered in EHIS at the end of 2015 were included to study migration.

Difficulties

1. Educational institution addresses from EHIS needed additional processing to make them more accurate.

2. There was no information about local units in registers, therefore data from sample survey SES was used.

Daytime population By Ülle Valgma

Methodology

Daytime population data are based on estimated locations of people in the daytime. It was assumed that persons employed are in their workplace, students in universities or schools, kindergarten children in kindergartens and other people at home (retired, unemployed, other non-working people). Since prisoners and conscripts were not separately studied and their employment status was not identified in the pilot census, most of them are registered in the daytime population data based on the registered place of residence in the Population Register. Everyday movements of people for the consumption of services and recreation were not taken into consideration.

The source for daytime population data are the pilot census data as at 1 January 2016 (1,315,944 inhabitants), excluding persons whose place of residence was unknown (1574 inhabitants). Employment status was used as a basis for determining daytime location. As employment status is not specified based on one day, the first full working week before the moment of the census

87 was used (14 December 2015 – 20 December 2015). Based on employment status, daytime locations were divided as following:

 Persons under 15 years of age and students – daytime location is an educational institution, the data source is the Estonian Education Information System (EHIS).  Employed persons – daytime location is the location of job. The data source is: . the Structure of Earnings Survey.

The job location was obtained at settlement level for kind-of-activity units. The Structure of Earnings Survey was a source of workplace data for these enterprises which were represented in the survey sample and operated in more than one location:

• the Statistical Business Register, which includes the locations of economically active enterprises with the legal address. The location of job was obtained from the Statistical Business Register for all enterprises for which the data of the Structure of Earnings Survey were not used.

Unemployed, retired and other persons – place of residence was used as their daytime location, the data source was the Population Register.

The location of the place of residence was also used for those employed persons, students and under 15 years of age whose daytime location was not successfully determined.

If an enterprise had several activity locations, i.e. activity units, the employees of these enterprises were distributed into kind-of-activity units according to the proportions shown by the Structure of Earnings Survey. For example, if according to the Structure of Earnings Survey, 25% of employees of an enterprise worked in Tartu, the same percentage share of employees of this enterprise were placed in Tartu in the daytime population dataset.

For the daytime population grid map, point-based locations, which were known at dwelling, building or cadastral unit level, were aggregated. The exact location was known for 86% of inhabitants (1,133,296 inhabitants). Daytime locations which were known at settlement level (4% or 56,400 inhabitants) were randomly distributed on the territory of the settlement unit, taking into account the locations of buildings in the settlement unit. The locations of jobs known at the level of Tallinn and Kohtla-Järve city districts and municipalities, totalling 10%, were proportionally distributed based on the 2011 daytime population grid map. In the beginning, attempts were made to distribute them randomly based on the known territory, but this method was not good, mainly because it did not take into account the division of cities into business

88 districts and residential districts, and distributed the working population evenly across the territory of the municipality.

As 14% of the data are randomly placed and persons have also been randomly distributed between kind-of-activity units, a grid map was drawn only for the total population.

Results and analysis

A 1 km x 1 km grid map was created for analysing daytime population data.

The map displaying distribution of daytime population (map 4) indicates concentration of population in urban settlements in the daytime. Daytime population is the densest in the city centre of Tallinn, around Estonia Puiestee, where according to estimates, 26,000 people are located per square kilometre in the daytime. Daytime population data are also published in the statistics map application.

Since approximately 100,000 people, or 7% of the population were excluded from the daytime population map of 2011, whose job location was not known or it was abroad, the results of 2016 cannot be compared to the current ones.

Map 5 describes changes in the nighttime and daytime population in municipalities. It shows that population concentrates into major cities in the daytime whereas population of their neighbouring municipalities decreases in the daytime. Daytime population is bigger than nighttime population only in a few municipalities where industrial parks are located, for example, in Rae rural municipality (Map 5). Map 6 depicts a similar change at a more detailed level – on a grid map. Daytime population increases the most in the city centre of Tallinn – by more than 10,000 people. A population increase is also perceptible in county centres. The increases happen at the expense of the neighbouring municipalities of these cities as well as at the expense of bedroom suburbs. The biggest daytime population decrease is at Mustamäe, in the area between Vilde, Tammsaare and Mustamäe Tee in Tallinn (5400 people). In 2011, such district was at Lasnamäe, between Muhu and Linnamäe Tee (more than 6500). The latter district lost 5000 people in the daytime in 2016. These changes are also partly due to changes in the age composition. For example, in 2011, the number of 65-69 year old people in the district of Muhu Street and Linnamäe Tee was 252, which by 2016 had more than doubled, to 619 inhabitants.

Map 7 pictures the quality of daytime population, showing the share of randomly distributed daytime population across grids. 14% of the daytime population was placed randomly: 4% of

89 them within the settlement unit, 5% city district and 5% within the municipality. Although in the cities the percentage shares of the randomly distributed population in the grid are smaller, the absolute numbers are several times bigger. The number of randomly distributed population is the biggest in the city centre of Tallinn – 3900 people, which however is only 15 % of the daytime population of the grid. Even if to deduct the randomly distributed persons from the daytime population of that grid, this grid still remains the most densely populated grid.

90

Map 5. Nighttime versus daytime population change in municipalities, 2016

91

92

Conclusions

The results of the study migration analysis conducted under the grant agreement demonstrate that it is possible to analyse study migration and daytime location of population based on the register data. The secondary school migration has not significantly changed since 2011. In general, high schools close to home are preferred.

It is possible to create a daytime population grid map. A daytime population grid map evaluates the distribution of population in the daytime. 14% of the inhabitants have randomly placed jobs.

The results have been published in Statistics Estonia blogs and on the EFGS website.

93

1.11 Study of confidentiality issues for renewal of PHC 2000 and PHC 2011 datasets for releasing grid-based data By Ülle Valgma

One of the objectives of population censuses is to collect population and housing data on small areas, which are not reflected in regular annual statistics. The development of geoinformation systems has made it possible to distribute and analyse data on small areas irrespective of administrative or settlement division. This is supplemented by various applications for viewing and searching maps. This creates additional requirements for data confidentiality.

There are no specific legislative restrictions on the release of spatial data. The applicable rules are derived from other legal acts. The spatial aspect is often incomprehensible for persons, who are not dealing with spatial data, which is why they cannot assess potential confidentiality issues in this context. The confidentiality rules applied by Statistics Estonia to the release of spatial data have changed as a result of general development of the geoinformation field and increased availability of various geoinformation services. For this reason, we have traced the development of the field over time.

Confidentiality conditions for grid-based data in PHC 2000

Geoinformation systems were adopted in Estonia for the organisation of the census of 2000 and for the distribution of results. Geocoding of addresses enabled to create a geoinformation system based on building data. This made it possible, for the first time in Estonia, to release grid-based data of the population and housing census and to extract information on various smaller regions from the census results, irrespective of administrative or settlement divisions. The data were released as a 1,000 x 1,000 metre grid covering the whole of Estonia. The release included main indicators of residents, dwellings and households. The release was in conformity with the Official Statistics Act and the Personal Data Protection Act. As the web services for viewing spatial data and making spatial queries were not widely used in Estonia at the time, the only restriction in releasing the data was the requirement to hide small values. The following measures were used to prevent indirect identification of grid-based data:

• Aggregate value of individual indicators per grid square was not released;

94

• Indicators were published on a standalone basis, i.e., without releasing cross-sections of different indicators;

• The values 1 and 2 were replaced with 99,999;

• Data releases were subject to contracts, which specified the purpose of data use.

The advantage of such release was that settled and unsettled regions corresponded to the real situation. The disadvantage was that indirect identification of particular persons was still possible. However, as interactive online and smart applications for viewing spatial data and combining different data elements were not available, it was not possible to view data together with the base map or an orthophoto. This data was available only to the users of the GIS software if needed. Data protection regulations had only recently been adopted in Estonia. The confidentiality rules were included in the Official Statistics Act.

Results of PHC 2000, i.e., data that required confidentiality

The 1x1 km grid map, which was based on the results of PHC 2000, included 21,465 grid squares. This included 10,063 squares (47% of the squares) with only 1 or 2 buildings with dwellings.

Selecting a random square with a hidden total number of dwellings (value 99,999) showed the only value in the group “area over 100 m2”. This value was 99,999. This data seemed to indicate that the square includes 1 or 2 dwellings with an area over 100m2. However, it was possible to query other grid-based data of PHC 2000 for the same square to find out that the dwelling(s) in the square have 4+ rooms, they are private houses built in the period 1919–1945, and a resident is a single employed, economically active male Estonian citizen in the age group of 31–40 years. All this information is presented in different tables with the value 99,999.

Selecting a random square with a hidden total number of dwellings (value 99,999) it was also possible to see in the dwelling area columns under 99,999 the items “area 40–49 m2” and “area 80–89 m2”. This data seemed to indicate that the square includes 2 dwellings with areas 40–49 m2 and 80–89 m2. Quering other PHC indicators for the same square revealed that both dwellings have 3 rooms, they are located in private houses, one was built before 1919 and the second in the period 1946–1960. There are 5 people living in these dwellings. They are all

95

Estonian citizens. One of them is single and 4 are legally married. They belong to age groups 16–20, 41–50, 51–64 and 65+. It is likely, that the single person is in the age group 16–20. Three of the persons are economically active. This information is also presented in different tables with the value 99,999, except for the number of employed persons, which is 3.

Consequently, the described method information presentation was not suitable and the rules for releasing grid-based data had to be supplemented with a new rule for substituting 1 and 2 values, taking into account the possibility that users may receive data on multiple indicators, which would enable them to combine different tables.

The Official Statistics Act was amended in 2010, and §35(7) specifies that a producer of official statistics may disseminate data that allow indirect identification of a natural person by sex, age and settlement unit without the consent of the person. In terms of size, 1 km2 grid squares are comparable to many smaller Estonian settlement units. Initially, the legal provision was interpreted as being applicable to the release of grid map data and, after the law entered into force, the PHC 2000 data on population size, sex and age distribution were released in an unhidden format but the data were still released only for specific use under a contract.

Confidentiality conditions for grid-based data in PHC 2011

In 2009, the Estonian Land Board started providing a web map service (WMS), which enabled users of GIS software to use the Land Board’s base maps. A statistical map application was developed for the release of PHC 2011 data, with functions including information views, map compilation, downloads, selection of different base maps, and address search. The map application is available free of charge. The only restriction is that data users have to refer to Statistics Estonia as the source of data. This resulted in a possibility to search for a particular address and to view the census data for the same area. Consequently, confidentiality rules had to be reviewed and made stricter. It was no longer sufficient to hide small values as required by law and the problem had to be resolved in a more comprehensive manner. It was also important to prevent the possibility of identifying vacant dwellings or single elderly persons in sparsely populated areas. As a result, new rules were developed for releasing grid-based data in the map application:

96

• Display a grid square value of <4 if the square includes less than 4 inhabitants or the square includes vacant dwellings;

• Do not display sex and age group values for squares that include less than 6 inhabitants or less than 3 conventional dwellings with less than 20 inhabitants in total;

• Apply the directed rounding method with a base of 3 for disseminating data on households, dwellings, economic activity and education of residents;

• Do not use the grid-based release method for sensitive data, such as religion and ethnicity.

The map application displays data in the format of a 1-kilometre grid. In addition, we created a 500-metre grid for densely populated areas (all cities with surrounding settlements) and a 100- metre grid for the five largest cities (Tallinn, Tartu, Pärnu, Narva, and Kohtla-Järve). However, we only release population size data on those grids, and no other, more detailed sections.

The map application also enables spatial queries based on building data. The queries are compiled based on original data, but the released results are rounded to nearest 10. Consequently, the result may be ±5 different from the actual value.

Describtions of the portion of population/dwellings that are affected by confidentiality issues.

Estonia's 1 km grid data includes 1,979 grid squares with only one inhabitant, and 2,324 squares with two inhabitants. In total, these combined 4,303 squares have 12,900 inhabitants, i.e., 0.94% of the total population (including only residents with geocoded dwellings – 1,369,996 in total). The total number of settled squares is 21,723. Squares with only one or two inhabitants constitute 19.8% of all grid squares. These calculations apply for the entire Estonian territory. It should be remembered when developing confidentiality rules for grid-based data that spatial queries are also made for small areas. In this case, the results are somewhat different depending on whether they apply to squares or inhabitants of sparsely or densely populated areas. For instance, squares with one or two inhabitants constitute 19.1% of all settled squares in Põlva county but 22.5% of all settled squares in Ida-Viru county. The respective percentages of population are 1.35% and 0.21%.

The Estonian territory includes 23,487 grid squares with no inhabitants.

97

Describe the impact of data rounding on the result

The rounding method is used to conceal small values in the table by modifying the actual values up or down so that the sum would be as close to the actual value as possible (Figure).

Figure 1. Grid-based level of education, rounded to a base of 3, 5 and 10

Tables are rounded with a base of 3, i.e., all values are rounded to the nearest figure divisible by 3, which protects the values 1 and 2. This limits the volume of information noise generated, while still preserving the possibility to release data on small units. For instance, in grid-based data, the values of squares that are below 3 are rounded to zero or three, and the rest are rounded to the nearest figure divisible by 3. In addition, the individual breakdown values of an indicator are not displayed if the total value per square is less than 3 (i.e., in case of age groups, the number of people in different age groups is not shown). The values of the most detailed disclosed level differ from the actual values by one or two persons or dwellings. The difference can be larger in case of sums, but is always below 1%. Special tau-Argus software is used to apply this method.

Release of centroid data

A centroid is the centre point of an enumerated residential building or a building that includes dwellings. A centroid can be linked to building data: number of inhabitants, number of dwellings, building area, type of building, and date of construction.

98

In response to information queries, we have release centroid data combined with building data from PHC 2000, but without data on inhabitants. These data have been released under contracts for research purposes to research institution, the Pärnu Local Authorities Association, and the Land Board.

The centroids of PHC 2011 are not released. Centroid-based spatial queries are possible in VKR output map application. The rule of rounding to 10 is applied to ensure confidentiality.

1.12 Exchange of experiences-study visits to the Dutch Statistics and Statistics Norway In order to create a well-functioning system of statistical registers and to successfully hold a register-based census, Statistics Estonia needed to exchange know-how and study the experiences of other Member States. We decided to have study visits to Norway and the Netherlands. Because we had over taken the concept of the system of statistical registers used in Norway, but we had some questions as to how to implement the system and how to keep the system updated. Our interest was also to introduce with methodological basis and systems which have been adopted to ensure quality at the Dutch Statistics.

At the Dutch Statistics: the last complete enumeration was in 1971. The Dutch Statistics conducts a register-based census.For the 2011 Census, census experts compiled the required census tables by combining existing register data with sample survey data.

The Netherlands have a good methodological basis and systems which have been adopted to ensure quality. Thay have the quality framework for administrative data sources „Checklist for the Quality evaluation of Administrative Data Sources“.

Summary:

Administrative data quality directly affects register-based census results. Process quality is measured by best methods, cost efficiency and low response burden. Nowadays population and enterprises are over burdened with questions and the correctness of answers suffers from it. Lowering response burden has to be the goal.

When considering data collection methods (census, surveys, administrative data) the decision is made based on cost efficiency, response burden and product quality. Even though register-

99 based approach is lowest in cost, continuous quality checks are necessary because administrative data is collected primarily for administrative purposes not for census.

Use of administrative data means no additional response burden about data that can already be found in registers. Register data is based on administrative definitions that may differ from statistical definitions. When using administrative data the NSIs have to acknowledge that they don’t have full control of the data.

Register-based approach has proved to be efficient because statistics can be produced annually or even more frequently. It’s possible to produce statistics about small groups which sample surveys don’t cover.

Using administrative data lowers costs and response burden but makes the NSIs dependent on data sources that are collected and maintained by others.

Data confidentiality in the Netherlands is secured by using anonymous linkage keys. Access rights to the microdata are restricted.

The virtual census has proved to be successful concept in the Netherlands. Quality framework is a useful tool for making data decisions in the virtual census. After the 2011 census the Education Register has been improved (e.g., since 2016 contains information on private education institutions).

Mr E. Schulte Nordholt emphasized that EU countries can learn a lot from each other. Sharing information about what works, what doesn’t and about failures is beneficial.

At Statistics Norway

Norway has moved gradually from traditional census to combined to fully register-based census (first in 2011) in a long period of time.

Statistics Norway has always played a central role in Norwegian society. They had influence and confidence. Statistics Norway was the producer of “register ideology”.

Nordic countries’ statistical developments are the best in the world and Statistics Norway has advanced data collecting and updating.

Statistics Norway has developed quality indicators for registers. They produce them monthly and give feedback to register owners and statistics Norway has published occupation variable

100 since 2003. Cooperation between employers and statistical office works well and steps have been made over the years to improve it even further.

Summary

The two-day study visit was divided between two cities and three institutions. On the 27th of March we stopped by the Directorate of Taxes, the Service Center for Foreign Workers and the central office of Statistics Norway in Oslo. The first day lasted about seven and a half hours, including the time spent on transportation and lunch. On the 28th of March we travelled by train to Kongsvinger where the branch office of Statistics Norway is located. The second day lasted about nine hours. Discussions took place during travelling also.

The aim of the study visit was trying to find solutions to the problems discovered during the first trial census in 2016 by visiting countries and sharing experience in conducting register- based census. Norway had the first combined census in 1980 and the first fully register-based census in 2011. The last obstacle before going fully register-based was creating dwelling number variable which did not exist before. Census data was used to establish registers which were used in the next census when their quality became sufficient.

Long-time coperation between Nordic countries has been beneficial to all parties when it comes to missing and outdated information (e.g., information about place of residence and education).

In Norway people usually register their place of residence correctly because there are almost no incentives to do it incorrectly. Norway has published occupation since 2003. Before 2015 there were quality issues but cooperation with employers has improved and new agreements have been made. Since 2017 sanctions are applied with positive results to employers who don’t report on time or report wrong on purpose.

Norway has advanced data collecting. Automatic data transmissions are conducted daily. Statistics Norway calculates quality indicators and gives regular feedback to register owners.

Statistics Norway has always played a central role in Norwegian society. Their methodology, including register-based approach has always been accepted by the public. Influence and confidence is something Estonian Statistics needs to strive for in the society.

101

II. Detailed evaluation of the results of the action including assessment of the quality of data

The quality report on assessment of register data was prepared in the framework of the grant activity: Adapting quality framework for the evaluation of administrative data meant that quality assessment is not only done in order to inform the users but also to be used by the producers of statistics to monitor data quality for the purpose of continuous improvement. The assessment instructions were built upon two standards: The ESS Code of Practise (CoP) and the Generic Statistical Business Process Model (GSBPM). The CoP principles are a major part of the motivation to assess and inform users about the quality of the statistical products GSBPM was used to organize the quality measures and indicators. Statistics Estonia has decided to increase the use of administrative registers for statistical purposes. The target is to prepare for a register-based census using an information technology solution to maximise automated data collection; this requires also information technological interoperability. The evaluation of registers was intended to built up a continuous dialogue between the data user and the data provider about the quality. The data files may be exact copies of administrative registers, or they may be especially made for the purpose of the statistical needs. For the internal quality monitoring of specific data files in an established production process, a more targeted set of quantitative indicators should be developed in near future.

Estonian register holders have been issued lists of actions required for implementation of a register-based census; these include deadlines and general requirements for database quality assessment, which have to be met by chief processors of databases. In order to conduct the census, the following situation should be achieved by 2021 and beyond:

 Estonian addresses in all databases (registers) are extracted from the address data system of the Land Board – this ensures standardised address data;

 All persons in databases are identified on the basis of personal identification code (for natural persons) or register code of the commercial register or non-profit associations and foundations register (for legal persons);

 Data acquisition is based on the X-Road data exchange layer – this reduces the number of technical data errors;  The data for register-based census and other surveys are collected according to agreed data structures, which meet the statistical requirements.

102

Registers and Statistics Estonia have established the following targets of preparations for register-based census: 1. Ensure implementation of location addresses conforming to the address data system; 2. Ensure that, by 1 December 2017, classifications are administered and used in accordance with the structure registered in the Administrative System of the State Information System (RIHA); 3. Ensure data migration and transfers to Statistics Estonia via X-Road; 4. Ensure, by 1 January 2018, regular updating of data, regular data quality assessments, possibility to link data to personal identification codes of natural persons and register codes of legal persons; 5. Statistics Estonia conducts register data quality assessments according to the Official Statistics Act.

As regards to classifications, Statistics Estonia has to approve the classifications used in the databases that are part of the state information system and has to use the agreed classifications in the census round. Harmonisation of classifications and terminology used in administrative registers is one of the key actions in the preparations for register-based census. Standardised metadata specifications will increase the efficiency of data processing.

Administrative registers currently use their established classifications and definitions, which are often different from the internationally harmonised census classifications and definitions. Furthermore, it was discovered during census preparations that some classifications used in administrative registers have not been specified. An important task is compiling metadata for census characteristics. The specifications of characteristics will enable register holders submit data in the XML format, facilitating efficient collection of data based on the SDMS framework.

Stages of quality assessment were the next:

1. Development of a quality framework and manual

2. Presentation of the manual to register holders

3. Testing activities according to the manual (Population Register and the ADS(addresd data) system)

4. Analysis of feedback

103

5. Development of a quality assessment approach, based on the main requirements for a register-based census

6. Presenting the new quality assessment principles to register holders

7. Quality assessment

8. Presenting the results of quality assessment to the census steering group, the census committee, and researchers

9. Follow-up activities for a pilot census

104

III.Adapting quality framework for the evaluation of administrative data It is vital to create the necessary prerequisites for a register-based statistics and to develop a system of organised and harmonised state registers, which requires contributions towards rising the level of data quality.A system of organised and harmonised state registers is highly valuable to Statistics Estonia and enables significantly reducing the administrative and response burden of enterprises, institutions, organisations and inhabitants, while increasing the capacity to monitor other processes taking place in the society (besides those observed in the census) as well.

It has become clear that it is necessary to study the data sources which are new to determine and develop quality criteria for the data that is going to be captured, to assess the quality of data and give feedback on it, to develop rules for using the data in the system of statistical registers and to put in place methodologies for capturing and processing data for the register-based census.

The following results are obtained:

1. Methodology has worked out for data source quality assesment

The developed data quality instructions for register holders includes methodologies for measuring and ensuring the quality of data in the information system as a whole. The instruction specifies a methodology for managing the monitoring and supervision of database quality and includes recommendations for metrics to be used in data quality monitoring.

The developed framework for data quality management includes three elements:

I. Data quality model for measuring and improving the categories associated with data quality for statistical purposes. II. Set of data quality indicators, which can be used for testing different aspects of data quality. III. Framework for data quality management, which is a set of iteratively implementable actions to ensure data quality.

2. Manual for data quality assessment of registers

Register holders found the developed manual (see Annex 3) difficult to understand and use.

105

Little progress was made when working according to the manual and, therefore, a decision was made in the spring to simplify the quality assessment procedures for register holders. It was important that all register holders understand the procedures in the same way. It was decided to use the data quality requirements as a basis. The manual will be used for those registers that are used for the first time for data acquisition.

3. Report on data quality results

The regular data quality assessment was performed in cooperation with Statistics Estonia and chief and authorised processors of registers in 2017(Deadline was 1 Dec 2017). The results of the quality assessment were presented to the REGREL Working Group on Registers on 24 November 2017 and to the REGREL Steering Group on 29 November 2017. The quality assessment report was performed in the registers included in the 1st pilot census. An overview is in Annex 1.

106

Summary: The main factor that undermines the quality of census data is the difference between registered and actual residence information. This situation has a significant impact on the structure of households and families. The quality of the data in the State Register of Construction Works is inadequate. The low quality of the Register of Construction Works is caused by under-coverage of buildings and dwellings, incomplete data on technical characteristics, and lack of updates.

4. Report about metadata quality (verification of classifications)

An overview of the status of classification verification is in Annex 2. The deadline for verification of classifications was 1 Oct 2017. Exceptions for two registers: 1. Register of Residence Permits and Work Permits

2. Register of Imprisoned and Detained persons and Persons Held in Custody

Summary: The work with classifications has been performed in registers, but the main problems were associated with upgrading the administration of classifications and links to the latest version. As a result, the latest version of the classification of administration and settlement units (EHAK) has not been adopted or the classifications have not been implemented. Due to the administrative reform, comments concerning the new version of EHAK were omitted from the review of the use of classifications in registers. However, as a singular exception, one register received a note regarding EHAK, because they did not have any integration with EHAK (which is required). Only one information system – e-File – did not have any issues according to the data structure specified in RIHA. As of 24 November 2017, several information systems (Land Register, Health Insurance Database, National Defence Obligation Register, Social Protection Information System, Mandatory Funded Pension Register, Traffic Register, Medical Birth Register, and Causes of Death Registry) had not updated their links only to the new version of the classification of administrative and settlement units of Estonia, which is applicable since the end of October. The quality was up to standard and no comments of Statistics Estonia were needed in 9 out of 22 registers. (See Annex 3)

107

4. Report about data exchange ways As regards to data exchange, we assessed the method of data transmission. In order to conduct a register-based census, we need to maximise the level of automation in the data collection process, based on an IT solution. The data should be collected from registers and the collected data should meet the statistical requirements. The current time required for collection of data from registers is not acceptable; the time resources required for data acquisition need to be reduced. Problems are caused by the manner of data collection. The quality assessment revealed the following obstacles that prevent optimisation of the data acquisition process:

1. Non-standard data format (e.g., address data);

2. Different methods of data exchange used by different agencies;

3. Uneven data quality.

To sum up Statistics Estonia assessed the quality of data in databases. We worked out requirements for ensuring data quality in the census:  Registers must follow up on the results of regular quality assessment conducted by Statistics Estonia;  Data acquired from registers must be compliant with the characteristics as agreed with Statistics Estonia, they must be complete and accurate;  Registers must verify internal consistency of their data (e.g., to detect and correct duplicate entries and errors, which are identifiable by combining different characteristics, etc).

Positive was that:

1. address standard has been implemented in the Population Register; 2. both register holders and Statistics Estonia work on measuring the quality of data; 3. a secure data exchange platform is available; 4. registers have appointed holders and assigned tasks; 5. census data can be used for methodological work with data; 6. some registers have adopted data quality standards.

The quality assessment identified three main issues that complicate the conduct of a register- based census:

108

1. For nearly a quarter of the population, the registered place of residence differs from the actual place of residence; analysis covering the 1st and 2nd quarter of 2017 (Eurostat grant Improvement of the quality of EU census (2021 and post-2021);

2. Occupation and workplace location of residents are currently not recorded in registers;

3. The quality of the data in the State Register of Construction Works is inadequate. The low quality of the Register of Construction Works is caused by under-coverage of buildings and dwellings, incomplete data on technical characteristics, and lack of updates.

Statistics Estonia has emphasised that the decision to adopt the method of register-based censuses implies the following:

 virtually all mandatory census characteristics are covered by registers;  there is a functioning system to enable identification of all census objects;  the address data system is used  the difference between de jure and de facto data.

Accoridng to the grant results the pilot census of 2019 will be based on two possible scenarios for the next population and housing census, with relevant empirical data collected for two options.

Option A.

Piloting of a full-scale register-based census where nearly all EUROSTAT’s mandatory output characteristics are calculated on the basis of register information.

This option facilitates testing of:

 Availability of information in registers and transportability of the data;  Quality and coverage of the register information in relation to the total population;  Performance and accuracy of the algorithms developed for the calculation of census characteristics;  Capacity of model-based indexes (residency index, partnership and location index) to generate estimates that reflect the actual situation.

The outputs of the pilot census include the most relevant hypercubes, which will be compared with the respective cubes determined on the basis of data from 2011. This helps to highlight

109 any developments in the recent period and to test the adequacy of detailed information obtained from register data.

The quality of the results of the pilot census will be assessed according to developed rules and norms both with regard to individual characteristics as well as sets of characteristics (cubes and marginal cubes).

If the pilot census produces adequate results and outputs that meet the international requirements, it means that a register-based census is feasible in Estonia.

If the results indicate that some census characteristics

 cannot be calculated on the basis of registers, or  coverage or quality do not meet international requirements, then option B will be implemented.

Option B will be implemented if registers cannot guarantee the required level of accuracy for some census characteristics (international practice indicates that this number can be between 3 and 5). The methodological work performed on so far indicates that the number of non- conforming characteristics will certainly not be higher than that. For the remaining characteristics, the register-based census methodology described under Option A will be applied. The characteristics that cannot be estimated on the basis of registers with sufficient accuracy will be handled separately.

For such characteristics, the data collected with large-scale surveys of Statistics Estonia (LFS, e.g) in the past year(s) will be used to impute the values of missing characteristics according to a suitable statistical imputation (prediction) procedure. Imputation ensures consistency between all hypercubes and marginal cubes unlike, for instance, the table matching method used in the Netherlands, where necessary information cannot be obtained on small groups of people.

This combined methodology conforms to international requirements. It also retains the current person-based approach to population statistics.

The main body of work in the pilot census in 2019 will be based on Option A. In addition, the pilot census will be used to identify suitable imputation methods and to test imputation performance with at least one characteristic. The result will be compared with the register-based data distribution for the same characteristic.

110

IV. Assessment of sustainability over time and plan for implementing further changes in the statistical system as a result of the action

4.1 Two possible strategies of using index-based methodology in population statistics

1)Improving existing data

When using the register-based household data (household = persons living in the same dwelling), it may happen that the distribution of different household types seems to be shifted, or biased when comparing it with earlier data or data of surveys.

In this case, it is not necessary to recalculate all families, households and their dwellings, but only correct the seemingly biased results.

This happened when the register-based data on household structure of the first pilot of the census 2021 were compared with the similar data of the 2011 census. It turned out that the group of single parents had increased dramatically in the course of five years from 6.6% to 10.9%, i.e. by about 57,000 persons. This change was too big to be explained by the changing social situation and the difference in definitions, and so it was necessary to analyse the possibility that these persons had not registered their living quarters accurately.

To improve this bias, the partnership index was used only to find partners to single parents. Here, the following rule was used: the potential partners were not partners in existing families but either single parents or single persons. It was possible to find a partner and a family for more than 40,000 single parents. This step, in general, solved the problem, as the remaining difference between the census and pilot census data was explained by other reasons (change of definition, changes in society).

Now, the next step will be finding suitable common living quarters for all these 40,000 families.

111

2)Recalculation of all household and placement data

Another approach is to calculate partnership indexes for all adult persons. As a result, it may happen that some couples registered as a family will be broken and new families will be formed. However, at the moment, we do not see this step as an aim of population statisticians.

4.2 The work to develop partnership index continues

In the current version, 10% of actual couples did not have any SOPs. The only way to detect these couples is to introduce new SOPs. We have plans to include data on sharing vehicles, paternity leaves, fathers using parental benefits, and single parent child’s allowance.

For some people, the choice of partner is not uniquely defined, as they may appear in multiple quasi-couples. We must create a selection mechanism—for example, highest index value or time since last SOP—and test its performance.

A control survey is scheduled in 2018 to test the performance of new census methodologies, including the partnership index.

Finding partners for people is only one step in household formation. Each family must also be assigned to a suitable dwelling. Since we cannot rely only on place of residence in PR, additional sources are considered.

4.3 Register’s data quality assessment

Statistics Estonia assessed the quality of data in databases. We worked out requirements for ensuring data quality in the census for register holders. Requirements were approved by census steering commitee.

Follow up activities:

• Register holders should follow up on the results of regular quality assessment conducted by Statistics Estonia;

112

• Data acquired from registers must be compliant with the characteristics as agreed with Statistics Estonia, they must be complete and accurate;

• Registers must verify internal consistency of their data (e.g., to detect and correct duplicate entries and errors, which are identifiable by combining different characteristics, etc).

4.4 Revison of parameters of indexes

The main concern in a register-based census pertains to the poor quality of registers, which results from incorrect data submitted by the population. Greatest problem in this respect is the inaccuracy of residence data in the Population Register.

This has forced Statistics Estonia to develop an ‘index methodology’ to verify and specify the register data on the basis of a large number of other registers and data sources. In the pilot census in 2019, this methodology will be tested in three particular cases – residency index, partnership index, and placement index.

All these indexes use Estonia’s administrative databases as sources of information, which can be combined to form an interoperative data system with common identifiers. Assuming that, in the present day, a person living in Estonia inevitably leaves certain traces of activity in the form of records in different databases, it is possible to verify the person's residence in the country, as well as connections between persons and their locations, on an annual basis. Such verification is based on signs of life, signs of partnership and signs of placement that are recorded in registers every year. The annual indexes are established as linear combinations of the respective signs, which makes it possible to trace the change in a person’s status in different years.

The indexes are calculated for all persons who have received an Estonian personal identification code. This makes it possible to monitor transnational persons who have left Estonia, incl. to detect whether they have returned or how trans-boundary commuters move between their homeland and other countries.

Even though the general indexing principles have been established and model parameters have undergone empirical assessment, the methodology itself is still developing and new signs can be added depending on new information (incl. big data) becoming available. The accuracy of

113 the index-based estimates is assessed through use and additional surveys, and the results are provided with potential estimation error values. Addition of new information (further signs) will result in consistent improvement of the accuracy of index-based estimates.

The statistical database should be supplemented with the address of the shared dwelling of the household, which was identified on the basis of the partnership index (if it is different from the person’s current address entry).

The person’s household status – either a partner in a private household or distant household, determined algorithmically – should also be added.

These data are dynamic in nature in the same way that family and household relations between persons can change. Consequently, these are not fixed data, which are not suitable for a register- based database (unlike data obtained from a single survey).

It is likely that future statistics will include distant relationships as a result of the increasing number of secondary dwellings. Family sociologists of the Nordic countries have been doing this for several decades. Significantly, this is associated with transnationalism and commuting.

Estimation of parameters for indexes

The best option for estimating the values of parameters (and weights) is to use the methods of multidimensional analysis (linear and logistic regression and discriminant analysis).

114

V Assessment of the possible applicability of the methodology/procedure to other context

Implementation area for the use of administrative data sources and index base approach is possible in regular population statistics, census, housing statistics. This activity covered business processes of data collection and production in official statistics that could use administrative data sources;processes of transforming administrative data into data fit for producing official statistics. This included outputs based only on administrative data. Alongside data develops also the methodology of census statistics, i.e. new possibilities will emerge for processing data.

New data categories and data formats require improvements in methodology and new methodological approaches.

The data analysis methodology is significantly affected by calculation possibilities as well as opportunities to apply more and more complex and resource-demanding calculations.

115

VI. Summary of the problems encountered

During this grant activities a few complications occurred in this process regarding to get access to data from the new data sources.

Also regarding to the European General Data Protection Regulation entered into force on 25 May 2018, we did not know what kind of changes should be exemined regarding the activity developing statistical output production.

While working out the methodology for the register –based census it became obvious that the list of data sources is subject to change. The lists of both signs of partnership and signs of life are never final. We keep looking for new data sources and we have to be ready when some sources are discontinued. For example, Database of Work Ability Assessment and Work Ability Allowance started collecting data in July, 2016; the data on working ability assessment and partial working ability was added as a sign of life to residency index. Also, due to changes in Income Tax Act, married couples cannot submit joint income tax return starting from 2018, this eliminates a good predictor for partnership status.

Obtaining data from new sources may require patience. Even if the needed data source exists and the data holder has agreed to share the data, the process can be time-consuming. For example, we have had intention to add shared vehicles to partnership model for over a year. Since the beginning of 2017, we have received four extracts from Estonian Traffic Register, but none of them was in the form necessary for indicating partnership status. We keep working to explain our needs. Defining the suitable query is complicated by the fact that vehicle data is also used in residency index and for Household Finance and Consumption Survey.

This reflects also a wider issue on data queries. Our queries represent our needs, but they may leave ambiguity to data holder. Database systems are often complex, and Statistics Estonia does not always know their nuances. The inconsistency in interpreting the query may lead to inappropriate and/or incomplete datasets. For instance, Statistics Estonia has experience with incomplete extracts from Population Register. Two supposedly equivalent queries for obtaining birth records gave different results, both had records that were not captured with the other query. Also, when asking for mother and father for each person, we did not get data on adoptions of same-sex couples. The fact that data on same-sex parents even exists in Population Register

116 came to us somewhat randomly while discussing partnership index methodology with the representative from Population Register.

Checking data quality beyond the obvious (e.g. missing values, unique identifiers) requires creativity. The index-based methodology gives a different view on known datasets and allows to compare multiple sources. For the partnership index, the object of analysis is a (quasi- )couple, rather than a person. When constructing couples from Population Register data, several inconsistencies in marital records data were discovered. However, getting grasp on systematically missing data when no comparison is available remains challenging.

Poor data quality wastes data. Although Estonia has unique codes for addresses via ADS, it has not been implemented in all the databases, e.g. The State Human Resources Database, Social Services and Benefits Registry. Instead, the addresses are represented as plain text with varying quality. Since it is not always possible to link these with ADS, some records become useless. For example, in the housing loan data, we did not find address code for 7% of records.

117

VII Description of how further actions could improve the use of administrative sources

Statistics Estonia has emphasised that the decision to adopt the method of register-based censuses implies the following:

 virtually all mandatory census characteristics are covered by registers;  there is a functioning system to enable identification of all census objects;  the address data system is used  an existing difference between de jure and de facto data.

Accoridng to the grant results the pilot census of 2019 will be based on two possible scenarios for the next population and housing census, with relevant empirical data collected for two options.

Option A.

Piloting of a full-scale register-based census where nearly all EUROSTAT’s mandatory output characteristics are calculated on the basis of register information.

This option facilitates testing of:

 Availability of information in registers and transportability of the data;  Quality and coverage of the register information in relation to the total population;  Performance and accuracy of the algorithms developed for the calculation of census characteristics;  Capacity of model-based indexes (residency index, partnership and location index) to generate estimates that reflect the actual situation.

The outputs of the pilot census include the most relevant hypercubes, which will be compared with the respective cubes determined on the basis of data from 2011. This helps to highlight any developments in the recent period and to test the adequacy of detailed information obtained from register data.

The quality of the results of the pilot census will be assessed according to developed rules and norms both with regard to individual characteristics as well as sets of characteristics (cubes and marginal cubes).

118

If the pilot census produces adequate results and outputs that meet the international requirements, it means that a register-based census is feasible in Estonia.

If the results indicate that some census characteristics

 cannot be calculated on the basis of registers, or  coverage or quality do not meet international requirements, then option B will be implemented.

Option B will be implemented if registers cannot guarantee the required level of accuracy for some census characteristics (international practice indicates that this number can be between 3 and 5). The methodological work performed on so far indicates that the number of non- conforming characteristics will certainly not be higher than that. For the remaining characteristics, the register-based census methodology described under Option A will be applied. The characteristics that cannot be estimated on the basis of registers with sufficient accuracy will be handled separately.

For such characteristics, the data collected with large-scale surveys of Statistics Estonia (LFS, e.g) in the past year(s) will be used to impute the values of missing characteristics according to a suitable statistical imputation (prediction) procedure. Imputation ensures consistency between all hypercubes and marginal cubes unlike, for instance, the table matching method used in the Netherlands, where necessary information cannot be obtained on small groups of people.

This combined methodology conforms to international requirements. It also retains the current person-based approach to population statistics.

The main body of work in the pilot census in 2019 will be based on Option A. In addition, the pilot census will be used to identify suitable imputation methods and to test imputation performance with at least one characteristic. The result will be compared with the register-based data distribution for the same characteristic.

119

Follow-up work

The data quality assessment showed that the methodology used for forming households needs to be improved to meet the statistical requirements; the issue is related to the difference between actual and registered places of residence, which concerns ca 20% of the population. For this reason, Statistics Estonia needs to explore additional data sources to identify potential ‘markers of life’ and ‘markers of partnership’ in order to compile and test partnership and location indexes. The adopted strategy for verifying the Estonian household and dwelling data entails the following: 1. Determining the number of Estonian residents and limiting searches to residents only;

2. Identifying potential partners for all single parents / single individuals;

3. Establishing pairs based on identified partners, following specific matching criteria;

4. Establishing family nuclei by adding all of their children to pairs;

5. Identifying the best-matching dwelling for each family nucleus. It is possible that household members not included in the family nucleus are added to the nucleus in the course of the matching process.

Challenges

The strategic risk that the results of the register-based census will not meet the needs of Estonian or international users is still relevant, because it is difficult to find a workable solution to the problem of data quality in registers. The main problems are still the disparity between registered and actual places of residence and the poor quality of the data required for census characteristics in the Register of Construction Works.

We have identified a new operational risk for 2021 census preparations. As of 1 November 2017, Statistics Estonia is unable to: 1. Obtain an overview of established and mandatory classifications;

2. Adopt established classifications and link a database/information system to the classifications used (no overview of the classifications that use the database/information system);

120

3. Submit new classifications for approval;

4. Submit proposals for change of classifications;

5. Notify database holders of changes in classifications.

6. The information that was entered in the old RIHA can be viewed, but corrections, additions and changes are not possible.

In order to resolve the situation, SE started nogoations with the RIA RIHA product owner; a solution is currently pending.

Conclusion According to the grant results the strategic risk for register-based census is that the results of the register-based census will not meet the needs of Estonian users, because it is difficult to find a workable solution to the problem of data quality of place of residence in register and the poor quality of the data required for census characteristics in the Register of Construction Works.

The data quality assessment showed that the methodology used for forming households needs to be improved to meet the statistical requirements; the issue is related to the difference between actual and registered places of residence, which concerns ca 20% of the population.

For this reason, Statistics Estonia needs to explore additional data sources to identify potential ‘markers of life’ and ‘markers of partnership’ in order to compile and test partnership and location indexes.

The adopted strategy for verifying the Estonian household and dwelling data entails the following:

1. Determining the number of Estonian residents and limiting searches to residents only;

2. Identifying potential partners for all single parents / single individuals;

3. Establishing pairs based on identified partners, following specific matching criteria;

4. Establishing family nuclei by adding all of their children to pairs;

121

5. Identifying the best-matching dwelling for each family nucleus. It is possible that household members not included in the family nucleus are added to the nucleus in the course of the matching process.

The main concern in a register-based census pertains to the poor quality of registers, which results from incorrect data submitted by the population. Greatest problem in this respect is the inaccuracy of residence data in the Population Register. We have agreed about the need for standardized methods for measuring the quality of input data from administrative sources. The agreement also include undrestandings that a standardized method should consist mainly of quantitative indicators, and it should not be too comprehensive. Definition of indicators has to be clear.

This has forced Statistics Estonia to develop an ‘index methodology’ to verify and specify the register data on the basis of a large number of other registers and data sources. In the pilot census in 2019, this methodology will be tested in three particular cases – residency index, partnership index, and placement index.

All these indexes use Estonia’s administrative databases as sources of information, which can be combined to form an interoperative data system with common identifiers. Assuming that, in the present day, a person living in Estonia inevitably leaves certain traces of activity in the form of records in different databases, it is possible to verify the person's residence in the country, as well as connections between persons and their locations, on an annual basis. Such verification is based on signs of life, signs of partnership and signs of placement that are recorded in registers every year. The annual indexes are established as linear combinations of the respective signs, which makes it possible to trace the change in a person’s status in different years.

The indexes are calculated for all persons who have received an Estonian personal identification code. This makes it possible to monitor transnational persons who have left Estonia, incl. to detect whether they have returned or how trans-boundary commuters move between their homeland and other countries.

Even though the general indexing principles have been established and model parameters have undergone empirical assessment, the methodology itself is still developing and new signs can be added depending on new information (incl. big data) becoming available. The accuracy of the index-based estimates is assessed through use and additional surveys, and the results are provided with potential estimation error values. Addition of new information (further signs) will result in consistent improvement of the accuracy of index-based estimates

122

Results:

4. Production and output of housing statistics in February 2018. 5. Development of partnership index, which will be tested during the census pilot in 2019 6. Description of used registers metadata in metadata repositorium for the second census pilot in 2019. 7. Methodology has worked out based on data source quality assesment, instructions for data quality assessment for register holders. 8. The developed data quality instructions for register holders includes methodologies for measuring and ensuring the quality of data in the information system as a whole. 9. The developed framework for data quality management includes set of data quality indicators, which can be used for testing different aspects of data quality. 10. Quality assessment report of registers(, which will be used in the second pilot census in 2019. 11. New data sources were implemented into index approach based methodology and methodology for the register-based census using 24 different registers was presente 12. Methodology was worked out for determining houshold composition for register- based census in Estonia.. 13. Methodologies were developed for adopting new data sources for census and population and housing statistics.

11.Revison of rules and according to new ones production of statistical output using register data. We are aware of all the gridmaps indicators and we are able to provide them after 2021 Census. Regarding of that the 2021 population census data will be published, Estonia can dissiminate the precise number of population and sex-age distribution data.

12.Blog about dissiminated georeferenced data

https://blog.stat.ee/2017/12/21/gumnasistide-koolitee-pikkus/

13.Tested methodology for overcoming confidentiality problems for the next census round

14. Manual for data quality assessment of registers was created.

123

ANNEXES

ANNEX I Grant activities timetable

ANNEX II Data quality assessment

ANNEX III Use of classifications

ANNEX IV Manual

ANNEX V Algoritms for housing statistics

124