Subtitle “Promoting the usage of administrative data in Statistics by describing and harmonising metadata”

Final grant report

Table of Contents

Executive Summary ...... 3 List of acronyms ...... 5 Introduction ...... 6 1. Obtaining knowledge about best practices of administrative data and metadata management system from another Member State (study visit) ...... 7 1.1. Summary and difficulties encountered ...... 11 2. Analysing and compiling data about current agreements, data sources and data structure descriptions ...... 12 2.1. Summary and difficulties encountered ...... 15 3. Analysing the questionnaires and finding variables that could be replaced by administrative data ...... 15 3.1. Summary and encountered difficulties ...... 26 4. Mapping management processes of administrative data and metadata in Statistics Estonia .. 28 4.1. Summary and difficulties encountered ...... 36 5. Creating vision document on how to give feedback to the data owners about data transmission deadlines and agreed data structures ...... 36 5.1. Summary and encountered difficulties ...... 38 6. Describing metadata for the data sources whose cooperation agreements are renewed in the metadata system ...... 39 6.1. Summary and difficulties encountered ...... 41 7. Renewing cooperation agreements made with data owners before the year 2010 ...... 42 7.1. Summary and difficulties encountered ...... 45 References ...... 45

2

Executive Summary

According to Statistics Estonia’s strategy, our goal is to produce high quality statistics with as low administrative burden and as high efficiency as possible. In order to achieve this, we need to improve the use of administrative data and describe the related metadata in our metadata management system.

At the moment, Statistics Estonia uses over 100 different administrative data sources (state registries) in the statistical production process. Managing, describing and improving the related information and metadata of those sources is a challenging and ongoing process.

In this project we have described and standardised metadata for the data sources whose cooperation agreement needed updating. During the process we also had the chance to develop and strengthen the partnership with the data owners, which is the key element of using the data of administrative sources.

Our project started with learning form ’s experience and we were able to analyse and work through all our administrative data management related information to start managing it more efficiently.

Efficient data management is only possible if we have optimized management processes. During the project we were able to map the as-is and to-be processes of administrative data and metadata management.

The volume of administrative data and metadata is growing fast, so it is now clear, that we need to move towards more automated processes. For that reason we have created the vision document for developing the new information system Administrative Data Gate, that will allow to send automated feedback and reminders to the data owners and also automate the data checking processes.

The grant project enabled us to analyse our current questionnaires domain by domain and make suggestions to use additional administrative data sources to lower the response burden. This analyse was a new approach for us, because usually the statisticians are responsible for their statistical activities. But now we analysed different questionnaires together centrally and had the opportunity to give the statisticians some new ideas, which sources to use and improve the usage of administrative data in our organisation.

3

Statistics Estonia is very grateful for being able to realise the activities in this grant project. It was really helpful that we could have our temporary employee who worked through a lot of information and we were able to start moving towards more automated processes of managing administrative data.

4

List of acronyms

ATAO – Statistics Design Department

BORA – The Beneficial Owners Register Act

EAS – Enterprise Estonia

ECAA – Estonian Civil Aviation Authority

EMDE – Electronic Maritime Information System

GSBPM – Generic Statistical Business Process Model

MUIS – System of (Estonian) Museums

RIHA – Administration system for the state information system

SE – Statistics Estonia sDWH – Statistical Datawarehouse

TÖR – Working register

5

Introduction The main objective of this grant project is to improve the use of administrative data sources in Statistics Estonia. According to Statistics Estonia’s strategy our goal is to produce high quality statistics with as low administrative burden and as high efficiency as possible.

Producing high quality statistics is possible only when we have standardised metadata and efficient production processes. Harmonised metadata is usable across all statistical domains, which means if one data source is used in different statistical activities the metadata will be described only once. The described metadata will be available directly for the users and for the systems in the live production environment.

Statistics Estonia has set the goal to reduce administrative and response burden for the respondents. This is possible only if we use more administrative sources and quit using some of the questionnaires or prefilling some values on the questionnaire to help the respondent to answer.

Improving the use of administrative data for the statistical production has one key element – close cooperation and partnership with the data owners. At the moment Statistics Estonia uses over 100 different administrative sources and our goal is to build closer cooperation with the data owners in order to ensure efficient negotiations and high quality data delivery. One important part of the cooperation are valid and up to date data delivery contracts – it is important that both sides of the contract know their responsibilities and that data owners know why the data is needed and that it is securely processed and stored in Statistics Estonia.

We have planned several activities during the project that will help to re-use information, produce statistics more efficiently, reduce administrative and response burden.

To perform the tasks of the grant project we have conducted weekly project team meetings, where we discuss and agree on the tasks of upcoming week. We are using the web application (JIRA) for assigning and monitoring planned activities. All the team member need to report weekly on the progress of their tasks and possible difficulties, to ensure the execution of the project on schedule.

6

Now the overview of the progress on the tasks of the grant project is given. The overview is written task by task and also the level of progress is evaluated. For every task the encountered difficulties and overall summary are described briefly.

1. Obtaining knowledge about best practices of administrative data and metadata management system from another Member State (study visit) In order to use best practices available in other European statistical offices, we planned to have a study visit at the beginning of the project. In order to choose our possible destinations, we gathered some information from our colleagues, who have attended different working groups. We received information that Austria and Finland both have advanced systems for managing metadata and administrative data. Both countries also conduct register based population and household census.

Statistics Austria was able to welcome us in October to share their knowledge and experience. And as Austria is considered one of the leaders in European statistical system concerning the use of administrative data, we were happy to plan the 1,5 day agenda for the study visit.

The study visit took place from 17 to 19 of October and we had a very full agenda for the 18 and 19 of October. From Statistics Estonia four people attended the visit:

 two Leading Methodologists from Statistics Design Department (responsible for negotiations with data owners, describing metadata of administrative data and preparing the contracts);  Developer from the Data Service Department (responsible for the data warehouse and developing the new IT system for administrative data management and automated controls);  Head of Data Description from the Statistics Design Department (responsible for the processes of managing administrative data and describing metadata in Statistics Estonia).

The agenda of the study visit was full of very useful and interesting topics for us. The overview of the study visit by agenda topics is given below.

• Coordination and guideline for administrative data

7

The first topic was an introduction of how Statistics Austria manages their administrative data and related information. Statistics Austria uses over 500 different data sources and over 50 sources are used for the register based census. They do not have data delivery contracts with all the sources, because the management of the contracts would be too burdensome and their national statistical law says that Statistics Austria can have the data for free from the data owners.

Statistics Austria currently has a separate metadata database for administrative data. It is the ACCESS-database used since 2008. The aim of this database is to get an overview of all administrative data available in Statistics Austria, have the list of projects that use administrative data and a search functionality to choose from the available data. In this database also information about external and internal contact persons and organisation details are stored. Statistics Austria plans to integrate the current metadata database to their new centralised data and metadata management system Statistical Datawarehouse (sDWH). Then the metadata for the administrative data will be extended and also data structures, attributes, classification lists, quality indicators, data formats, statistical units, reference dates, key words and legal basis will be available.

• Statistical Datawarehouse (sDWH) – Motivation

Statistics Austria has developed statistical datawarehouse to guarantee internal, house-wide, easily accessible data/metadata platform. sDWH project was established in 2014 and technical solution was fixed in 2015. In April 2017 Statistics Austria started the implementation phase rolewise and departmentwise.

Metadata is described in the sDWH and has to be defined before it is possible to incorporate a new dataset. This supports the housewide harmonisation of concepts and other metadata.

There are different roles for the sDWH users which helps to manage the workflow of data and metadata management. For example the Administrator can define represented variables, data sets and load the data, but Quality Manager has to approve or disapprove the represented variables and data sets.

• Statistical Datawarehouse (sDWH) – Application (handling of metadata)

Statistics Austria also demonstrated the live demo of their sDWH. For us it was most impressive to see that data and metadata are stored in one application and that it is also possible to link different datasets in the system and visualise the results. In the sDWH all links are also

8 visualised, for example it can be seen which data set is used in which project. The system also shows possible joining options and the data descriptions at the variable level are only one click away. sDWH enables to mark some variables and data sets as protected and then the in-house data owner has to provide the permission for using the data set. The process of asking and granting the permission is also part of the sDWH – all the permissions and explanations are stored in one system, the users do not have to send separate e-mails for that.

• Register-based Census - an Overview

In 2001 the last traditional census was conducted in Austria, it’s cost was 72 million euros. In 2011 the register based census costed only 10 million euros.

Statistics Austria has more than 50 data sources for the register based census. In 2006 they also had the register based census test, where methods, data procedures and use of registers were successfully tested.

• Workflow of a Register-based Census

As Statistics Estonia is also planning to conduct register based census in 2021, it was interesting to hear about the workflow of the Statistics Austria’s register based census team. They have 13- 15 persons permanently on the census team, and for the census in 2011 they also had about 20 temporary team members responsible for different tasks.

Every team member is responsible for capturing some of the data sources and they remind the data owner one month in advance about the need to deliver the data.

Statistics Austria has process documentation system for management reporting, timetable and production schedule.

The process documentation is available in ADAM/EVA database – for example the timetable/calendar for execution of monthly, quarterly, yearly processes is held.

 ADAM/EVA database and documentation (handling of metadata)

9

The data for the census is stored currently in ADAM/EVA database, but future plan is to incorporate the data to sDWH. ADAM/EVA database is also used for metadata documentation. There is a search option for tables, variables, attributes and variable values.

• Other projects based on ADAM/EVA database

ADAM/EVA database is also used for other projects in Statistics Austria. For example labour force survey, national accounts, rich frame for social statistics, monitoring of education-related employment, tracking of graduates and register-based labour market career.

Rich frame is used for calibration/post-stratification, non-response analysis and substituting survey questions with administrative data.

• Statistical Datawarehouse (sDWH) – Future plans

Statistics Austria shared with us the future plans for the sDWH. They are planning to integrate all administrative data and metadata in the warehouse. Then they will have fully integrated and harmonised metadata management system.

Statistics Austria is also planning to create GeoWizard for automatic creation of working maps for internal and external use and all the necessary data, metadata and information will be stored in standardised way in sDWH.

For visualisation of statistical information Statistics Austria is currently in the process of evaluation of the Tableau software. Visualisation is important internally for the heads of departments to create reports about data usage and availability. Externally it is planned to develop dashboards for disseminating statistics in more user-friendly format.

• Quality assessment for Register-based statistics / metadata of administrative data

Statistics Austria has developed three stages quality evaluation system for the administrative data. The data quality is evaluated at the raw data phase, when the registers provide the data. The next phase is combining and linking the data in the central database and then the next evaluation process takes place. After combinations and imputations the data is available in the final data pool and the quality of data is evaluated again.

• Census - Analysis of Residence

Statistics Austria introduced us how they avoid overcoverage of residents. They have the system that if the person has only one record in the Central Persons Register, they have to confirm the

10 residence by answering the official letter. About 69 thousand letters were sent out last time to confirm the residence. If the residence is not confirmed by answering the official letter the person is a candidate for deletion. However, the local authorities have the opportunity to oppose the deletions by proving that the person is still the resident of their municipality.

For conducting the census successfully, Statistics Austria has annual quality evaluation for the residence data, all sources and outputs are analysed and evaluated.

• Business Register for Administrative Purposes and Beneficial Owner Register

Last topic in the agenda was the introduction of two Austrian registers.

Every entity taking part of the E-Government processes needs to be registered in one of the state registers. The business register combines different registers and is the basis for statistical registers.

The automatic data transmission times are different, some registers transmit the data to the business registers weekly, but some registers have the online connection and the data is always up to date.

The Austrian Beneficial Owner Register Act (BORA) obliges legal entities to register their owners. This should equip financial supervisors with a tool to fight money laundering and terrorism financing.

Due to the BR for administrative purposes Statistics Austria is optimal partner to technically implement that register for the Austrian Ministry of Finance.

The BO register is a great business case and the BORA explicitly allows Statistics Austria the usage of data for statistical purposes.

1.1. Summary and difficulties encountered In conclusion the study visit was very successful for us, we had the opportunity to learn from Statistics Austria’s experiences and best practices. Although we are at the different stage of using and managing administrative data than Statistics Austria, we got new ideas about how to optimise the processes of documenting metadata of administrative data.

Firstly, we were surprised to hear, that Statistics Austria does not have formal written contracts with all the data owners. And as our national statistical law also says, that data from the

11 registries can be obtained for free for the purposes of official statistics, then we are considering the solutions how to make the data transmission agreements more flexible. Right now we mostly have written contracts with the data owners or if the data transmission is done for piloting the data usage, we send the data request to get the data. We are currently developing the form of data transmission agreement that would describe the needed data structures and deadlines, but would be flexible and not so burdensome to change and keep up to date.

Secondly, we really appreciated the workflow management of sDWH. In our current metadata management system iMeta the metadata can be described only by the metadata team members and for the correct metadata we ask the input from statistical departments by Excel forms. However, we are currently piloting new metadata information system Colectica, where the workflow service is also integrated. So, in the future we want to implement the similar system as Statistics Austria, that also the analytics can insert the metadata, but before publishing it for use, the administrator from metadata team needs to approve the metadata.

Thirdly, after the visit we are convinced that in the future the metadata and administrative data should be integrated to one information system in order to use the data more efficiently in the statistical production process. The Data Service Department started piloting data virtualisation tool Denodo, where data catalogues can be created that integrate data and metadata into one system. Implementing this application would be most useful for the statistical departments, because then they do not have to link data and metadata themselves anymore.

The main difficulty of performing this task, was finding suitable time for the study visit for Statistics Estonia and Statistics Austria. It was our interest to have the study visit at the beginning of the project to be able to use the gained knowledge in our further actions.

We were not sure whether we will get approval for the grant project application from the Eurostat when we planned and attended the study visit in October. So there was the risk of not getting refunded for the study visit.

2. Analysing and compiling data about current agreements, data sources and data structure descriptions In order to be able to start with the tasks of renewing cooperation agreements made before 2010 and analysing questionnaires to find variables that can be substituted with administrative data,

12 we started the process of analysing and compiling information about current agreements, data sources etc. As Statistics Estonia is currently struggling to manage the information related to administrative data, we started the process of systemising and visualising the information we needed to manage. It was the first task of our temporary staff. An introductory task for the new employee was to create an overview Excel table of the data that Statistics Estonia captures form administrative sources. The information in the table is presented by data sets of different data sources. Each data set contains information about the data structure, the transmission channel, the format, and the deadline for the data to be transmitted. In addition, a brief description of the contract or data request has been provided and also the purpose of using the data in Statistics Estonia. This task helped our new employee to understand and see what kind of data Statistics Estonia receives from different data sources. The basis for the overview table was already created and consisted of the list of all the registries from whom Statistics Estonia gets data from. The first task was to add the information about the data structure, the transmission channel, the format, the deadlines for the data to be transmitted, a brief description of the contract or data request and the purpose of using the data. All the necessary information was collected by searching through different documents and information systems. The information was stored to document management system, metadata management system, shared computer folders, Outlook mailbox and JIRA tasks. The information has not been systematically stored or managed, so it made the task difficult for the new employee to find and compile all the necessary information. The stored contracts and data requests have not been always marked correctly as valid or not, so the hardest part of the task was to make sure which of the contracts and data requests are still valid. We have new annexes for every dataset we capture from the data owners, and new annex very often invalidates the former annex, but not always. So it was challenging to go through all the annexes and find currently valid ones. In Statistics Estonia the web platform called Confluence is used to manage internal information and to make it accessible to other colleagues. Every team has it’s own space or page in Confluence and different overviews and guidelines can be stored and shared that way. We decided that the overview table of different data sources also has to be visualised better and that was the next task for the new employee.

A summary table of contracts and requests for administrative metadata was compiled to Confluence. The overview is under Metadata team page, where the sub-page for administrative

13 data was created. The table contains a list of institutions and their registries which Statistics Estonia has a contract with or from whom data is obtained through data requests. In case of a contract, the date of signing and completion of the contract is attached to it. In addition, each data source has the information about contact persons of the institution to whom it is possible to turn to with data transfer issues. The compilation of the summary table gave an overview of what existing contracts were signed before 2010 and which should be updated.

The previous task with an Excel table helped to get started with this task. The list of institutions and their registries were taken from the Excel table and added to Confluence table. Our employee started collecting information about contracts and data requests via local discs and document management system called Livelink. Like in the previous task the most important part of this task was to make sure which contracts and data requests are still valid, and also which ones are the latest. Statistics Estonia keeps all the documents about each data source, even the ones that are not valid anymore. The situation that all contracts and data requests were stored in different places and were not in order made this task time-consuming. All the information about the contracts and data requests came from inside the document. So the employee had to read through every contract or data request file she found, in order to find the right information for the table.

After compiling the overview table to the Confluence, we decided that we need sub-pages about every data source. The main reason for that was, that Statistics Estonia captures many different data sets with different deadlines from one data source or registry. Also there can be different contact persons for different data sets and there are also different users in Statistics Estonia. So our new employee linked new sub-pages to the Confluence overview table and these sub- pages give the users more detailed information about the data source. Each sub-page has the description of captured data set, deadlines, contact persons information, user information and link to metadata management system, where the metadata of the data source is stored. In the future we also plan to link there the information about the data warehouse tables, where the administrative data is stored and can be accessible for the analysts of Statistics Estonia. This would give any colleague of Statistics Estonia the full information about each dataset, which is available for using.

14

2.1. Summary and difficulties encountered Performing this task was crucial for having better overview of the administrative data related information Statistics Estonia needs to manage. It also gave our temporary employee the needed knowledge about data available for use to perform the analysis of questionnaires.

Main difficulties were already described above, but it is important to highlight the large amount of information that our temporary employee had to work through and systemise. It was quite time consuming, because different documents have been stored in different places for historical reasons and now all this information had to compiled to visualise the existing situation. Completing this task is a big step ahead for Statistics Estonia, because now we can understand our needs for administrative data related information management system.

3. Analysing the questionnaires and finding variables that could be replaced by administrative data Statistics Estonia has 127 different questionnaires that the respondents have to fill out in order to produce statistics. Our aim is to reduce the administrative and response burden by improving the use of administrative sources. Although Statistics Estonia already uses about hundred different data sources, we were still convinced that there are variables on the questionnaires that can be replaced by administrative data. We have already about 35 questionnaires where we prefill some variables for the respondents in order to make the answering more convenient and less time consuming. When the grant application was written we chose some of the domains to be analysed during the grant project and our temporary employee started the process as soon as she had gotten the overview of our questionnaires and available administrative data. During the period of October 2018 until March 2019, we have found new data sources for our agricultural statistics domain, which is a really important domain in Estonia and so we decided to include the domain to our grant project and analyse it more thoroughly. The two domains that we are finished analysing by the submission of the intermediate report are culture and agriculture.

First step of the analysis was to get the overview of the questionnaires and collected variables of the culture and agriculture domains of statistical activities, and also to get the overview of the administrative data in use.

15

Second step was to compare the variables of questionnaires with the administrative data already in use to find possible new sources to replace questionnaire variables. For storing the new information and for a better overview of which variables collected by questionnaires can be replaced with administrative data, an Excel table was created. The Excel table contains the questionnaire code, a specific number of the statistical activity, the name of the statistical activity and then certain questions in the questionnaire with suggestions to replace with administrative data.

In Estonia we have the state level administration system for the state information system called RIHA. In RIHA every state information system needs to be registered. So actually RIHA is the catalogue of the state’s information system, where information is stored about which data are collected and processed and in which information systems. And also which services, including X-Road services, are provided and who is using them.

X-Road is the backbone of e-Estonia: it is the data exchange layer that allows various public and private sector e-service information systems to link up and function in harmony. X-Road has developed into a tool that can write to multiple information systems, transmit large datasets and perform searches across several information systems simultaneously. Today, X-Road is implemented in Finland, Kyrgyzstan, Namibia, Faroe Islands, Iceland, Ukraine and other countries. (e-Estonia, 2019)

The next logical step to find new data sources was to search the RIHA. If the information system owner has registered and inserted all the necessary information to RIHA, it is very good source of information for Statistics Estonia. Unfortunately, at the moment quite big part of the information in RIHA is outdated, because it needs to be updated manually by the data owners. But some development plans hopefully resolve this problem and keeping the information updated in RIHA can be automated in the future.

Additionally, we searched from the Internet to find data that is already public and can also be used by web-combing or other methodologies.

Third step was proposing to replace variables collected by questionnaires with the administrative data. This step included face-to-face meetings with people that work on the fields of culture and agriculture in Statistics Estonia.

16

Last step was planning the future activities according to the meetings held with the analytics of cultural and agricultural statistics. In some cases we managed also to have negotiations and meetings with the data owners to agree on the new data deliveries.

In the domain of culture we have 6 different questionnaires, which are divided to the following statistical activities: Movie, Museum, Music, Radio and Television. We made proposals to substitute some variables with new data sources to five different questionnaires. Our proposals and the results are compiled in the table below.

Suggestions Outcomes 1. Data about all the Estonian movies This suggestion was accepted and the next step is (movie type, name, duration) from to negotiate with Estonian Film Database Estonian Film Database. manager. 2. The number of museals in each Estonian museum from Information These suggestions were accepted, but there is a System of (Estonian) Museums plan to rearrange some of the parts in museums’ (MUIS). questionnaires, so there’s actually no full 3. The number of employees in overview of what kind of data will be needed after Estonian museums from Working the questionnaires redesign. register (TÖR). This suggestion was accepted partly, because there are multiple sites that are selling tickets online. In addition to these online selling companies, there are non-official sellers and also a 4. Music event names, types, number chance to buy concert tickets on site. So there’s no of concerts, number of tickets sold, accurate overview of how many people visited a ticket sales revenue and number of concert and how much was the ticket sales visitors from sites that are officially revenue. However, we have signed the contract selling tickets online in Estonia (for with one of the sellers Piletimaailm and will be example Piletimaailm and Piletilevi). receiving first dataset soon. Then our analytics can pilot the data usability. The information about music events names, types and number of concerts can be found from the site http://kultuur.info.

17

5. The number of employees with their This suggestion was accepted and as we are job titles in radio broadcasting stations capturing data from the Working register already, from Working register (TÖR). the analytics just have to take the data into use. 6. The number of employees with their This suggestion was accepted and as we are job titles in television broadcasting capturing data from the Working register already, stations from Working register (TÖR). the analytics just have to take the data into use.

In the domain of agricultural statistics we have 14 different questionnaires, which are divided in the following statistical activities: Sown area of field crops, Purchase of livestock and poultry, Livestock farming and meat production, Quarterly statistics of livestock farming, Purchase and use of milk, Economic accounts for agriculture, Farm Structure Survey, Agricultural products, Yields, Crop farming, Cereals, Dairy products, Organic farming, Supply balance sheets of agricultural products and Agricultural products. Agricultural statistics is one of the most important statistical domains in Estonia and also in Europe, but collecting data by questionnaires has always been burdensome to respondents in that field. That is the reason, why we decided to include agriculture, as one of the domains to our grant project. We started analysing the domain in the fall and our initial analysis showed that there are still some data sources that Statistics Estonia is not capturing and using for the agricultural statistics. The Veterinary and Food Board was the data source we started negotiations with and as a first step we asked them to send us some data sets for piloting the data usage. The data sets were about slaughtered animals, production of honey and the number of pigs slaughtered at home. Our analytics piloted the usability of the data and we compiled the data needs to start negotiations with the Veterinary and Food Board. Statistics Estonia’s data need was broad and we wanted to capture several data sets with different data delivery deadlines and also it involved different analytics from our side and different departments from the Veterinary and Food Board side. For effective discussions we had several meetings to agree on the different data sets compositions and data delivery deadlines. We managed to agree on all the datasets and now we get monthly and yearly data set about slaughtered animals. The monthly data set was immediately used for prefilling the questionnaires. Also we now get yearly data sets about the production of honey and number of pigs slaughtered at home. We also had meetings with two other data owners Estonian Land Board and Agricultural Board. Both sources are already in use in Statistics Estonia, but our data needs have widened and also

18 the composition of data in those registries have changed – so we need to work on new agreements and getting access to available data. Our proposals for agricultural statistics and the outcomes are compiled to the table below.

Suggestions Outcomes 1. The number of slaughtered animals, This suggestion was accepted and the the weight of edible/unedible meat from questionnaires are prefilled with the data from Veterinary and Food Board. monthly data set This suggestion was accepted and we have received the yearly data set about 2018, which 2. The information about honey was used for pre-filling the questionnaire. The production in Estonia from Veterinary quality of the data is very good and next year the and Food Board data will not be asked with the questionnaire - the statistics of honey production will be based on administrative data only. This suggestion was accepted and we already received the yearly data set about 2018, which 3. The number of pigs slaughtered at was used for additional data source for validating home from Veterinary and Food Board questionnaire data. In the future the data will be used to substitute the collected variables. This suggestion was accepted, but needs a 4. Number of people employed in the further methodological analysis. The data from agriculture field with their job titles the Working register is captured monthly, so if from Working register (TÖR) the analysis shows the compatibility of the data, it can be used for pre-filling the questionnaires. We have still ongoing negotiations with the Estonian Land Board to receive the land prices data from them. They have promised to make 5. The prices of land from the Estonian spatial analysis taking into account the land use Land Board, according to new data from the Estonian Agricultural Registers methodology and Information Board. Now we are waiting for the new spatial analysis by Estonian Land Board to see if this is sufficient for our data needs.

19

The negotiations are still ongoing, the Agricultural Board is a very important data source for organic farming statistics. The information system of the Agricultural Board is 6. Organic farming data from the in development and we have had several Agricultural Board meetings to explain Statistics Estonia’s expanding data needs. We need more detailed data about organic farming and we are negotiating to get our data needs to be considered in the new information system. Recently we got information that the Estonian 7. The number of fur animals, number Veterinary and Food Board will start collecting of animals slaughtered for fur, number information about the fur animals. Now the of skins sold etc. from Veterinary and negotiations are in the process of getting to know Food Board. the data composition and possibilities to get access to the data.

Obligations:  REGULATION (EC) No 1165/2008 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL (number of bovine animals, pigs, sheep, goats and poultry slaughtered in slaughterhouses)  REGULATION (EC) No 1165/2008 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL (carcass weight of bovine animals, pigs, sheep, goats and poultry slaughtered in slaughterhouses)  REGULATION (EC) No 138/2004 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL (Production account: Other animal products: others)  REGULATION (EC) No 1165/2008 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL (slaughtering carried out other than in slaughterhouses: pigs)  ESS Agreement on statistics of agricultural land prices and rents  COUNCIL REGULATION (EC) No 834/2007 of 28 June 2007 on organic production and labelling of organic products and repealing Regulation (EEC) No 2092/91 and Commission Regulation (EC) No 889/2008 of 5 September 2008 laying down detailed rules for the

20

implementation of Council Regulation (EC) No 834/2007 on organic production and labelling of organic products with regard to organic production, labelling and control  REGULATION (EC) No 138/2004 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL (Production account: Other animal products: others)

In the domain of accommodation statistics we have 2 different questionnaires, which are divided in the following statistical activities: Tourism and Accommodation activities. We made proposals to substitute some variables with new data sources to only one questionnaire, because our Tourism questionnaire only consists personal questions that can’t be replaced by administrative data. Our proposals and the results are compiled in the table below.

Suggestions Outcomes 1. The number of beds in The next step for us was to check the definition accommodation facilities from of “the number of beds” that is used in the EAS Enterprise Estonia (EAS) database . Is it how many beds are in total, or how many beds had been used?

2. Wheelchair access in accommodation Another important step for us was to make sure facilities from Enterprise Estonia (EAS) how EAS manages their database. The main question is: Does enterprises themselves voluntarily add information to the database?

Obligations:  REGULATION (EU) No 692/2011 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 6 July 2011 concerning European statistics on tourism and repealing Council Directive 95/57/EC  Commission Implementing Regulation (EU) No 1051/2011 of 20 October 2011 implementing Regulation (EU) No 692/2011 of the European Parliament and of the Council concerning European statistics on tourism, as regards the structure of the quality reports and the transmission of the data

In the domain of energy statistics we have 4 different questionnaires, which are divided in the following statistical activities: Electric power stations; Energy; Energy production, sales and

21 fuel consumption; Consumption of fuel and energy. We made proposals to substitute some variables with new data sources to only one questionnaire, which is “Energy”. Our proposals and the results are compiled in the table below.

Suggestions Outcomes Statistics Estonia is already receiving some 1. Data of produced, purchased and sold data from Elering. Our next step is to check electricity in Estonia from Elering. if Elering can give us necessary data monthly. SE is already using some of the data from the Estonian Road Administration. Next step 5. Data of the fuel used for freight transport is to check if we could also use the data from Estonian Road Administration. from the yearly car reviews. That would enable us to find out the fuel usage of the freight transport.

Obligation:

 Regulation (EC) No 1099/2008 of the European Parliament and of the Council of 22 October 2008 on energy statistics

In the domain of transportation statistics we have 23 different questionnaires, which are divided in the following statistical activities: Gas pipelines, Freight transport through ports, Freight transport on the road, Ships in the harbor, Ship traffic, Ship-based economic and social indicators, Ship registers, Marine accidents, Shipping-unloading, Air traffic, Flight accidents, Traffic Register, Road transport, Sea transportation, International travel through ports, Railway and rolling stock, Rail transport, Inland waterway transport, Vehicle registration, Tram-troll, Tram and trolley transport, Aircraft Register, Air transport. We made proposals to substitute some variables with new data sources to 3 questionnaires. Our proposals and the results are compiled in the table below.

22

Suggestions Outcomes 1. Number of air passengers, goods and mail Our next step is to check, if Airport transported by air from Tallinn Airport is willing to give us microdata about the website "Air Traffic Review" passengers, goods and mail. 2. The number of civil aircrafts from the Our next step is to make sure how and who Estonian Civil Aviation Authority’s is updating the website? And also how to (ECAA) website. ensure that the website has relevant data. 3. Data about the trucks (total weight, SE is already using some of the data from number of axles of the truck, type of the Estonian Road Administration. Next step bodywork, type of engine) from Estonian is to check if we could also use the data Road Administration. from the yearly car reviews.

Obligations:

 Regulation (EU) No 70/2012 of the European Parliament and of the Council of 18 January 2012 on statistical returns in respect of the carriage of goods by road  Commission Regulation (EU) No 202/2010 of 10 March 2010 amending Regulation (EC) No 6/2003 concerning the dissemination of statistics on the carriage of goods by road  Commission Regulation (EC) No 1304/2007 of 7 November 2007 amending Council Directive 95/64/EC, Council Regulation (EC) No 1172/98, Regulations (EC) No 91/2003 and (EC) No 1365/2006 of the European Parliament and of the Council with respect to the establishment of NST 2007 as the unique classification for transported goods in certain transport modes  Commission Regulation (EC) No 833/2007 of 16 July 2007 ending the transitional period provided for in Council Regulation (EC) No 1172/98 on statistical returns in respect of the carriage of goods by road  Commission Regulation (EC) No 642/2004 of 6 April 2004 on precision requirements for data collected in accordance with Council Regulation (EC) No 1172/98 on statistical returns in respect of the carriage of goods by road  Commission Regulation (EC) No 6/2003 of 30 December 2002 concerning the dissemination of statistics on the carriage of goods by road

23

 Commission Regulation (EC) No 2163/2001 of 7 November 2001 concerning the technical arrangements for data transmission for statistics on the carriage of goods by road  Commission Regulation (EU) No 520/2010 of 16 June 2010 amending Regulation (EC) No 831/2002 concerning access to confidential data for scientific purposes as regards the available surveys and statistical data sources  Directive 2009/42/EC of the European Parliament and of the Council of 6 May 2009 on statistical returns in respect of carriage of goods and passengers by sea (Recast)  Commission Regulation (EC) No 1304/2007 of 7 November 2007 amending Council Directive 95/64/EC, Council Regulation (EC) No 1172/98, Regulations (EC) No 91/2003 and (EC) No 1365/2006 of the European Parliament and of the Council with respect to the establishment of NST 2007 as the unique classification for transported goods in certain transport modes  2010/216/: Commission Decision of 14 April 2010 amending Directive 2009/42/EC of the European Parliament and of the Council on statistical returns in respect of carriage of goods and passengers by sea  Commission delegated decision of 3 February 2012 amending Directive 2009/42/EC of the European Parliament and of the Council on statistical returns in respect of carriage of goods and passengers by sea  Regulation (EC) No 437/2003 of the European Parliament and of the Council of 27 February 2003 on statistical returns in respect of the carriage of passengers, freight and mail by air  Commission Regulation (EC) No 158/2007 of 16 February 2007 amending Commission Regulation (EC) No 1358/2003 as regards the list of Community airports  UNECE, ITF and Eurostat Common Questionnaire for Transport Statistics Gentlemen's Agreement  Commission Regulation (EC) No 546/2005 of 8 April 2005 adapting Regulation (EC) No 437/2003 of the European Parliament and of the Council as regards the allocation of reporting-country codes and amending Commission Regulation (EC) No 1358/2003 as regards the updating of the list of Community airports  Commission Regulation (EC) No 1358/2003 of 31 July 2003 implementing Regulation (EC) No 437/2003 of the European Parliament and of the Council on statistical returns

24

in respect of the carriage of passengers, freight and mail by air and amending Annexes I and II thereto

In the domain of IT, research and development statistics we have 5 different questionnaires, which are divided in the following statistical activities: IT in the company, IT in the household, Business Innovation Survey, Research and development, Research and Development (in the company). We made proposals to substitute some variables with new data sources to 2 questionnaires. Our proposals and the results are compiled in the table below.

Suggestions Outcomes This suggestion was accepted partly. The information that TÖR has about the 1. The number of employees in the research employees in the research and development and development field with their scientific field is not matching with the definitions field, age and gender from Working Register that specific questionnaires have. (TÖR) But, TÖR can be used for checking the data collected by questionnaire. This suggestion was accepted partly. 2. The number of Information and Initially, TÖR could be used for checking Communication Technology specialists in a the data collected by questionnaire, and if company from Working Register (TÖR) TÖR’s quality gets better, we might be able to fully use it.

Obligations:

 Regulation (EC) No 808/2004 of the European Parliament and of the Council of 21 April 2004 concerning Community statistics on the information society  Commission Regulation (EC) No 753/2004 of 22 April 2004 implementing Decision No 1608/2003/EC of the European Parliament and of the Council as regards statistics on science and technology  Commission Implementing Regulation (EU) No 995/2012 of 26 October 2012 laying down detailed rules for the implementation of Decision No 1608/2003/EC of the European Parliament and of the Council concerning the production and development of Community statistics on science and technology

25

3.1. Summary and encountered difficulties The completion of this task was really challenging for us, because our temporary employee had to work through a lot of information. However, we managed to analyse the questionnaires and available data sources of culture, agriculture, accommodation, transportation, IT research and development statistics and now we have the overview of the step by step processes that need to be done in order to find new sources or new use cases for the administrative data already in use.

Some of our proposals were easily applicable, but some of the suggestions need further analysis from the statistical domain experts.

In the field of culture we had six proposals. The proposals 2 and 3 are waiting for the redesign of the questionnaire Museum and the redesign process will not be finished before 2021.

Regarding the proposal 1 to use data form the Estonian Film Database, we have already started the negotiation process and drawn the draft cooperation agreement. Hopefully it will be signed this year and next year we can start using the data.

The proposal 4 is already partly in production. We are currently receiving data from Piletimaailm, but this company is not the only seller of culture events tickets in Estonia. So for more complete data, we have started the negotiations with the other company Piletilevi. However, the negotiations with the private sector companies are time consuming and we are not sure when we will be able to receive data from Piletilevi.

The proposals 5 and 6 are about using the Working register (TÖR) data. As Working register is a quite new register in Estonia, the data is still quite incomplete as regards of job titles. However, we are expecting the completeness to get better by the end of this year and then it will be able to use the data across all statistical domains.

In the field of agriculture we had seven proposals. Proposals 1, 2 and 3 are already in production. The proposal 4 was also about using the Working register and it has to wait for better data completeness and analysis form statistical domain experts.

We already received the first dataset form Estonian Land Board according to new methodology, but the usability has to be analysed further and maybe we still need to process the data more, before it can be used directly in our statistical production process.

26

Proposal 6 to receive further information on organic farming is still in the draft agreement format. We have compiled our data needs and explained them to the Agricultural Board, but as their information system is still in development, we have not been able to receive the data or sign the new agreement yet. Hopefully we will be able to sign the agreement and get first datasets at the beginning of 2020.

Proposal 7 is not in production yet, because we have not received the confirmation from Veterinary and Food Board that they have data about fur animals. Our next step is to arrange the meeting with the data owner and clarify our data needs.

In the field of accommodation statistics we had two proposals to start using data from Enterprise Estonia. Our next step is to find out, how reliable is the information in this database. We currently have information that the enterprises insert the information there themselves voluntarily and that means the data completeness may not be that good.

In the field of energy statistics we also had two proposals. Proposal number 1 is about using monthly data from Elering. Currently we are receiving data from Elering once a year and since Elering is a private sector company the negotiations for more frequent data capturing will take time. We have planned to have a meeting with them to discuss whether it would be possible to start capturing monthly data in automated way for example using x-road.

Proposal 2 is about using more data from Estonian Road Administration. We are in the negotiations process to renew our data delivery agreement and automate the data capturing from the Estonian Road Administration. However, the negotiations are taking some time, because the information systems of Estonian Road Administration are in the development process. We are finalizing the draft agreement with our data needs and we hope to renew the agreement during next year.

In the field of transport statistics we had three proposals. The proposal number 1 involves getting microdata from Tallinn Airport. Unfortunately, their first answer was negative, because they consider giving microdata to third parties as a security risk. At the moment it is still unclear whether we would be able to justify our data needs legally and prove our data protection rules will ensure that it is safe to send data to Statistics Estonia.

Proposal number 2 was about using data from Estonian Civil Aviation Authority’s website. Our next step is to find out, how the renewal of the website is organised. For that we have to contact

27 the authority responsible for the website, hopefully we will get some answers by the end of this year and then can decide whether the proposal can be realised in the production process.

Proposal number 3 was also about using additional data from Estonian Road Administration and that will have to wait until the negotiations and renewal of the data delivery agreement have been finished.

In the field of IT, research and development we had two proposals, both were about using the data from Working register. We will wait until the end of this year to analyse the completeness and quality of the register data and then can decide how different statistical domains can use the data in their statistical activities.

Main difficulties of performing this task was going through huge amount of information and trying to find new solutions and sources for the questionnaire-based statistics. Statistics Estonia is aiming to use more administrative data and analysing questionnaires domainwise is innovative approach for us that has not been done before because the lack of the human resources.

4. Mapping management processes of administrative data and metadata in Statistics Estonia Statistics Estonia’s goal is to produce high quality statistics as efficiently as possible. Efficient production is possible if we improve and widen our administrative data use. Wider use of administrative data also reduces administrative and response burden. Statistics Estonia is already using over one hundred administrative sources. However, it has become challenging to manage all the information related to administrative data sources, for example information about cooperation agreements, deadlines, process phases etc.

During the project we have started to analyse and map the processes of managing administrative data and metadata in Statistics Estonia. The first task was to map the “as is” process. Below is the result of the mapping of “as is” processes.

28

Figure 1. As-is process of managing administrative data and metadata in Statistics Estonia This process map covers the process of using and managing administrative data from the first phase where the data need is identified to the actual usage of the data in statistical production. The project map involves five different departments of Statistics Estonia and the process goes through the GSBPM phases Specify Needs, Design, Build, Collect, Process and Metadata Management/Quality Management.

The central role in this process map has the Statistics Design Department (ATAO). The Statistics Design Department was created in 2017 and since then it has the central role of managing administrative data and metadata. The metadata management has been centralised in Statistics Estonia Methodology Department since 2004 and managing and capturing administrative data was formerly the responsibility of Data Warehouse Department. But as Statistics Estonia has started using administrative data more and aims to create and develop closer partnerships with the data owners, the management of metadata and administrative data was decided to centralise to the Metadata team in the Statistics Design Department.

The process map above describes the processes after the creation of Statistics Design Department. We are working on optimizing the processes of managing administrative data, it means we want to provide the data more efficiently and in more standardized way for the statistical production.

Below is the result of mapping the “to be” processes. For better understanding we split administrative data management process. Our aim is to simplify the usage and analysis of administrative data for the statistical departments and also to shorten the time of getting access to new data. Figure 2 shows the process of managing new or changed data need. Metadata team has the central role in this process and the process goes through the Specify Needs (1) and Design (2) phase of GSBPM. After getting input from Analysts, the Design phase is carried out by the Methodologists in Metadata team. The Design phase for administrative data includes defining the variables that need to be captured from the data source, compiling information for the data request or contract and preparing the data requests and contracts. In this phase, most of the communication and negotiation with data owners takes place. The administrative data manager’s role in this phase is similar to that of an intermediary or a “translator” – it is important to define the data needs as clearly as possible.

Administrative data management in the Design phase includes describing metadata for administrative data centrally, in cooperation with the owners of registers and statistical domain departments. The wide use of administrative data in SE has produced a lot of information related to data sources. For example, information about cooperation agreements, data requests, data delivery deadlines, data structures, formats, additional information about data, communication with data owners, process phases, etc.

The deadlines for data transmission in SE are currently managed and visualised in the web application JIRA. JIRA enables to monitor the process of data deliveries, data loading, processing, etc. There are different tasks for every data delivery, and every task and subtask can be assigned to a different person. Whenever problems or obstacles arise in some process phase, the questions and answers are inserted in JIRA as comments. This enables to get an overview of the workflow related to the specific dataset.

31

Figure 2. To-be process of agreements with data owners and managing administrative data and metadata Figure 3 shows the data capturing process that ends with the making the data available for analysts. This process goes through the Build (3) and Collect (4) phases of GSBPM.

Build and Collect phases for administrative data are the responsibility of the Data Service Department. In these phases, pre-processing the data and making them available to the NSI’s in-house applications is the role of administrative data managers. It is ensured through these procedures that there are no duplicate data and that the data are ready for statistical analysis.

Administrative data are captured through different channels:

1) encrypted .csv or .xls(x) files by e-mail, FTP or cloud services;

2) X-Road services that are divided into:

• pull services – the data owner has developed an X-Road service the content of which is suitable for SE. The data are pulled to SE through the X-Road service.

• push services to xGate – the data are pushed to SE through our xGate service. This is the preferred channel for data capture, because SE validates the received data against XSD, and the data delivery process is controlled by SE.

When administrative data have been captured through different channels, the loading processes begin. The first step is loading the data to the Initial Observation Registry (IOR). When the data are sent by .csv or .xls(x) files, the data will be loaded to Oracle database as they arrive. Loading and processing the data that has been sent with files is time-consuming for us, because there are constant problems with agreed data structures and wrong data formats.

When data are captured by X-Road pull services, the XML file is parsed to the IOR by Oracle tools. When data are captured by xGate, the file is parsed and validated against the XSD file generated in the iMeta system. After loading the data to IOR, it is possible to give the first feedback about the received data. The captured data are unloadable if the formats are incorrect or there are missing variables.

The next step is Data Staging Area (DSA), where data structure checks and conversions to correct formats take place. These checks and conversions are done according to the metadata descriptions in iMeta. It is also possible to develop more contextual checks, but for this, the input for the rules is needed from statistical domain departments. After DSA, it is possible to automatically generate a quality report about the delivered dataset. The last step is to make the data available for users, which means that the data are loaded to Final Observation Registry (FOR) and are pseudonymised if the data include personal data. The process of pseudonymisation involves removing personal identification numbers, names and contacts from the data. PIN-numbers are replaced with unique identifiers that allow the data to be joined. The unique in-house identifiers are not derived from PIN-numbers, which means that it is not possible to convert the unique identifiers mathematically to PIN-numbers.

The data are stored and versioned in Oracle databases, which are available for use to statistical domain departments through SAS or R.

34

Figure 3. To-be process of data capturing and making it available to users

4.1. Summary and difficulties encountered After the creation of Statistics Design Department the process of managing administrative data changed already. However, our goal is to redesign the processes to provide administrative data for statistical departments more efficiently and in the standardised way.

The main difficulty of mapping the current process was related to the fact that many departments are involved in this process. This also makes optimizing the processes challenging, because every step of the process has to be analysed thoroughly in order to find the solutions of how to simplify the process and shorten the time used for different project steps.

For having better understanding how to make our administrative data management more efficient, it was very helpful to read the document “Good practices in accessing, using and contributing to the management of administrative data” (Eurostat, 2018). The main advantage of this document is the compilation of experiences of different NSI’s. It is assuring to know that other statistical offices are on the same path and we are all moving towards better partnerships and administrative management processes. This document also gives an idea which are the countries we could learn from and ask for guidance.

To-be processes were mapped with as much detail as possible. That enables us to monitor the processes and make changes, if necessary.

Our next step is to create description of each process step and document how, who and what is done in every stage of the process. The goal is to create written instructions in order to make workflow more smooth and to enable new team members to know what to do more easily.

5. Creating vision document on how to give feedback to the data owners about data transmission deadlines and agreed data structures One part of optimising and standardising the processes of managing administrative data related information, is the automation of different notifications and feedbacks. Currently we are sending e-mails prior to data transmission deadlines manually and only to those data owners, who tend to forget their data deliveries. At the moment Statistics Estonia does not have an information system for automated data structure checks and for monitoring data transmission deadlines of administrative sources. We are in the progress of working out the vision document on how to give feedback to the data owners about data transmission deadlines and agreed data structures. We have analysed what type of information we need to manage in the information system – this includes the deadlines of data deliveries, related contacts and contract information and also the information about data structures, formats and metadata. We have also analysed different information systems that are already in use in our statistical production process and there are some information systems that could be developed further to provide some of the functionality needed for managing different information and send out automatic notifications. If the compliance with the agreed data structures and metadata would be checked automatically, then we also could generate the quality report for sending the feedback to the data owners. The analysis of our current information systems showed that we would need to develop new information system to enable automated checks and feedback. SE has created a vision document to develop new information system Administrative Data Gate. It will help automate the administrative data management in Design, Build and Collect process phases. The main functionalities of the Administrative Data Gate are: • Monitoring data deliveries and sending automated feedback and reminders to data holders. • Reading metadata from SE’s metadata management system and checking delivered data against the agreed structures and content. • Functionality to convert data to formats or structures needed by statistical domain departments. • Administrative Data Gate will allow to log and monitor every procedure that is done with the specific dataset. • Dashboard with main operations visible for users. The Administrative Data Gate would actually become the one channel, where all the administrative data goes through, as it is shown in Figure 4. The input data can come in different formats (csv, txt, xls, ods, xml, json) or from different channels (x-road push/pull services, e- mails), but all the data is guided through the Adminstrative Data Gate, where automated data checking and corrections are done.

37

After the data checking, the quality feedback report is generated and sent to the data owner. The quality feedback report’s content is not clear yet, but it will definitely contain information about data structures and data formats compliance.

Figure 4. Dataflow through Administrative Data Gate

5.1. Summary and encountered difficulties We have analysed our needs and have the overview of the functionality that is needed to manage administrative data related information efficiently and also to run automated controls on delivered data sets.

However, it has been difficult to decide whether we need to develop new information system to provide the needed functionalities or can some of our used applications developed to fulfil the needs. The analysis for this showed, that we need to develop new information system.

Now the challenge is to find financial and human resources to start the development process of the Administrative Data Gate. Statistics Estonia has already applied for financial support from the SF funds, but the feedback for the application has not arrived yet. So the timeline for the development process is still unknown.

38

6. Describing metadata for the data sources whose cooperation agreements are renewed in the metadata system Statistics Estonia is using about one hundred different administrative data sources in our statistical production process. Describing and harmonising the metadata for administrative data is time consuming, because there are several metaobjects in our metadatadata management system iMeta that have to be defined in order to fully document the captured data.

We are in the process of describing all the metadata for received administrative data, but during this grant project we will concentrate on describing and standardising the metadata of those data sources, whose cooperation agreements are signed before 2010.

We have done preparations for renewing the data delivery agreements and some of the metadata is already described in our metadata management system. The metadata description process involves also the data owners and analytics from statistical departments. The steps for describing the metadata for administrative data are following: • analysing already received data and adding variable descriptions, classifications and code lists to our metadata management system; • describing the rest of metadata related to the first sub-task according to Neuchâtel terminology model (conceptual variables, statistical characteristics, statistical unit types); • cooperating with the leaders of the statistical activities to describe and harmonize metadata efficiently; • describing metadata in the metadata system for additional data needs and giving the input for cooperation agreements renewal process. The Neuchâtel terminology model (Neuchâtel Group, 2004), has been used for describing the variables in our metadata management system. In this model, the variables are described in three levels – conceptual variable, statistical characteristic (object variable) and contextual variable. Statistical unit type is an entity for which information is sought and for which statistics are ultimately compiled. Statistical characteristic is a characteristic of a statistical unit type. Conceptual variable (concept) provides a general description of the meaning of the statistical characteristic without explicit reference to any particular statistical unit type. Contextual variable describes the variable in the context of a statistical activity. Contextual variables can be defined as register variables or cube variables.

39

Our goal was to describe and harmonise all the metadata of those administrative data sources, whose cooperation agreement was signed before 2010. So we started out with describing and harmonising all necessary metadata objects for:  Estonian Tax and Customs Board  National Institute for Health Development  Estonian Land Board  Agricultural Board  Agricultural Research Center

The Estonian Tax and Customs Board is a very important data source for us. They are the owners of several state registers, and SE captures 80 different datasets from them every year. The frequency of data capture varies from once a day to once a year. For this source we had to describe and harmonise 483 different contextual variables and also all the corresponding metadata objects. There were quite many variables that had to be specified with the data owners, because the forms of tax and customs declarations are constantly changing and for the contextual description of metadata, we had to be sure to understand each variable thoroughly.

The National Institute for Health Development is the source for death and birth statistics for SE. From this source we capture 147 different variables. The content of those variables was quite clear for us and it was not too troublesome to describe them in our metadata management system. Unfortunately we found out, that National Institute for Health Development is starting major developments in their information systems in order to unite different smaller registers into one big register. That means we have to be ready for changes in data content and also revise our metadata descriptions, when the development has taken place.

The Estonian Land Board has always been good cooperation partner for Statistics Estonia. They are the owners of Address Data System, that enables all the registers to exchange address data in harmonised way. For this source we had to describe 162 variables and corresponding metadata objects. As we started to prepare the new data delivery contract and review our current data needs, we also discovered that due to some changes in Estonian legislation the Estonian Land Board does not collect some of the variables that are needed in our statistical production process from the start of 2019. That means our analysts have to change the methodology of their statistical activities.

40

For agricultural statistics, one of the most important source is the Agricultural Board. At the moment, we capture data once a year, but with signing the new data delivery agreement we would like to start capturing data twice a year. We described 88 variables and corresponding metadata objects for Agricultural Board, some of those variables are still in draft version until our negotiation process to renew the agreement is finalised. However, we have had several very useful meetings with the source and also were able to incorporate the available data more efficiently in our statistical production process.

Agricultural Research Center is also an important source for agricultural statistics. Hopefully, we will start receiving twenty data sets and 63 variables from that source. As the negotiations for new data delivery agreement are still in progress, also the metadata description is in draft version. We are ready to change or supplement our current metadata descriptions, when the agreement is finalised.

6.1. Summary and difficulties encountered The main difficulty of performing this task is understanding the conceptual meaning of the data correctly. For standardising and harmonising metadata of administrative data and documenting it in our metadata management system iMeta, we needed to involve the data owners and also the data users from our statistical departments. There are two sources, Agricultural Board and Agricultural Research Center, whose metadata descriptions are partly in draft version. That means that we have done all necessary preparations for describing them, but they are not published in our metadata management system yet. We are waiting to finalise the negotiations to renew the data delivery agreements and then can publish also the metadata descriptions. So although the data descriptions are done and managed centrally in Statistics Estonia, there are still other parties to the process, whose knowledge had to be considered. This means that the process is time consuming and some meetings for agreeing on data definitions have to be conducted.

41

7. Renewing cooperation agreements made with data owners before the year 2010 During the grant project we plan to renew the cooperation agreements which are in force and signed before 2010. It is important, because before 2010 Statistics Estonia used a different contract format, which did not specify for example the delivered data structure. We are now moving towards automated data capturing and controlling systems, so it is really important to agree on specific data structures, formats and metadata.

Our analysis of data delivery contracts showed that we need to renew our contracts with five different institutions. And almost all those institutions own several registries from where Statistics Estonia captures different data sets. We have started preparing new agreements with:  Estonian Tax and Customs Board  National Institute for Health Development  Estonian Land Board  Agricultural Board  Agricultural Research Center

It is important to use the new contract format where the main part of the contract is updated and also the annex for detailed data compositions. The main part of the new contract format consists of:

1. General information (details of the parties and the purpose of the contract); 2. List of contract’s documents (annexes to the agreement are mentioned if any); 3. Object of the contract (content of the contract, explanation of the concept “data” and the method of transmission); 4. Rights and obligations for the parties (a list of rights and obligations that all parties need to follow); 5. Confidentiality (the confidentiality obligation for the parties is stated); 6. Contract performance obligations (consists following information: data transmission is at no cost, but the costs of performance of the contract shall be borne by each party from its budget); 7. Force majeure (a list of situations which obstruct the continuation or lawful existence of a contract amidst the parties);

42

8. Modification, completion and termination of the contract (consists information about the rules for modification, completion and termination of the contract to all parties); 9. Solving arguments (how the disputes arising from performance of the contract shall be resolved); 10. Other terms; 11. Contact information.

New annex(es) include the composition of the data at the variable level and contact persons for the transmission of data.

The renewing process included describing metadata for the captured datasets, because in the annexes we always define the data composition in detailed level.

Estonian Tax and Customs Board is a major data source for us. The data delivery agreement with them is in force since 2007. Since that SE’s data needs have grown and also quite many changes in the registers of Tax and Customs Board have taken place. It was absolutely essential to renew the cooperation agreement. For that we started the preparations from mapping the actual data needs of SE. For every data set we had meetings with the analysts who need the data and specified the data content. From those meetings we gathered questions and information that needed to be negotiated with the data source.

Some of the negotiations with the Tax and Customs Board took place via e-mails and phone calls. However, it is always more efficient to have the necessary persons around one table to agree on something.

The data content negotiations needed the involvement of the subject matter experts from both sides. As Statistics Estonia is using 80 different data sets from Tax and Customs Board, we had to arrange several meetings to specify the data content.

There were also separate meeting with the lawyers of both parties. Statistics Estonia has worked out the standard data delivery agreement. However, the Tax and Customs Board has their own standard agreements for data exchange. So, it was necessary to address legal issues and work out the agreement that suits both institutions. The legal negotiations were also successful and we managed to sign the new data delivery agreement with the Tax and Customs Board in May 2019.

The National Institute for Health Development is an important source for population and social statistics. With that institutions we have two separate data delivery agreements – one for each

43 register. Our goal is to have only one agreement with the National Institute for Health Development that cover the birth and death data. At the moment we have specified the data need within SE and also with the data source and prepared the draft data delivery agreement. However, we have not been able to complete the agreement, because National Institute of Health Development is currently developing their information systems that will surely change the data composition of the registers. The new information system will unite all the registers and we are waiting for the information on the data content to complete the data delivery contract renewing process.

The Estonian Land Board is a very important cooperation partner for us as regards to address data. Harmonised addresses are essential to every step of the statistical production process. In addition to that, Estonian Land Board is the data source for macrostatistics and agricultural statistics. With this source we have signed several data delivery agreements at different points of time, that for now are not up to date anymore. The main deficiency is that the data content is not defined in the annexes. To renew our existing contracts we started mapping the data need in SE and describing the variables already captured. Based on that we prepared the new draft agreement with the annex that covers all our data needs. Then we held several meetings with the Land Board specialists to specify the content and structure of our datasets. Now the details of the data content are clear and the draft data delivery agreement is ready to be signed. However, the specialists of the Land Board stated that they do not need the signed version of the data delivery agreement and they are ready to send the data in agreed structures and formats at agreed deadlines. At the moment the prepared new data delivery agreement is not signed, but works as the gentelman’s agreement between two institutions.

The Agricultural Board is a very important source for agricultural statistics. At the moment we have the data delivery agreement in force with them, but the annexes do not contain the needed data content descriptions and also some of the datasets are not mentioned in the agreement. In cooperation with our analysts we prepared the draft data delivery agreement and after that we had several meetings with the Agricultural Board’s specialists. The data content for the data deliveries is agreed, but there are some issues with the structures of datasets and the data delivery channel. The Agricultural Board is in the process of developing new information system and they would like to deliver data through x-road service from their new information system. Unfortunately, the development process is taking longer than expected and that means we have not signed the data delivery agreement yet.

44

The data delivery agreement with the Agricultural Research Center is at the draft version and negotiations are still ongoing. SE needs microdata from that source, but so far we have received only aggregated data. There are some legal issues that have to be cleared before we can start capturing microdata and the negotiation process will begin again in the autumn.

7.1. Summary and difficulties encountered We have analysed the information related to data delivery contracts and identified the institutions with whom the contracts need to be renewed.

Our temporary employee prepared the new format agreements according to our ongoing data deliveries. Then we had several meetings within SE to specify the data needs with analysts from different statistical departments. That was time consuming, because we had to bring different interested specialists together and agree on the data content of the draft data delivery agreements.

It was also challenging to negotiate with the data sources. Some of the sources are going through the information system development processes as mentioned above and that means, they are not sure of their ability to send us the datasets we need.

In conclusion it must be said that keeping the data delivery agreement up to date is challenging and time consuming. Mainly because it involves several interested parties, for example lawyers and subject matter experts.

References

E-estonia (2019), E-estonia’s webpage https://e-estonia.com/solutions/interoperability-services/x-road/

Eurostat (2018), Good practices in accessing, using and contributing to the management of administrative data https://ec.europa.eu/eurostat/cros/system/files/admin- wp1.2_good_practices_final.pdf Neuchâtel Group (2004), Neuchâtel Terminology Model for classifications (version 2.1) and variables (version 1.0)

45