Conference on New Techniques and Technologies for official Statistics (NTTS 2019) Brussels, 12–14 March 2019 (incl. satellite events on 11 and 15 March 2019)

Book of Abstracts

Population Statistics without a Census or Register 180 Owen Abbott, (Email) Office for National Statistics, Fareham, United Kingdom The Office for National Statistics (ONS) is exploring models for the future of population statistics. The UK does not have a population register or a set of coherent identifiers across administrative datasets held by government. The current population statistics system is underpinned by the decennial Census, which is expensive and is arguably becoming increasingly unwieldy as a source of data in a rapidly evolving society and with ever increasing demands for more timely, relevant statistics. The system is also highly reliant on a port-based survey to measure migrant flows to and from UK, with the result that the intercensal population size estimates tend to have an increasing element of bias. The ONS is therefore researching how it can transform its population statistics system within that context. This paper outlines the current plans for this, and focuses on some of the methodological challenges underpinning the transformation.

The Effect of Using New Technology and Geographic Information System on the Quality of Official Statistics: the Implementation of the General Palestinian Census 2017 as a Case Study 074 Aya Amro, (Email), Mosab Abualhayja, (Email) Palestinian Central Bureau of Statistics (PCBS), Ramallah, Other It has been confirmed that one of the most vital requirements necessary for planning and decision making in any field of human endeavor is the quality of information available on its human resources. In this sense, Palestinian Central Bureau of Statistics (PCBS) has always sought to offer the most objective and accurate statistical number relying on the latest techniques and methodologies, in order to be inline with the international recommendations regarding data collection and statistics production. As Geographic Information System (GIS) technology is used in a wide spectrum of official statistics activities nowadays, from data collection, to statistics compilation, and data dissemination, PCBS is very keen to keep pace with the revolution in GIS technology. PCBS has implemented the third Population, Housing and Establishments Census 2017 using GIS technology in all phases of the census process that improved the overall quality of census activities compared with the last census of 2007. Also, using this technology was of a great benefit in data coverage through the access to many remote areas and buffer zones in Gaza Strip and the West Bank. The contribution of this paper is twofold. Firstly, it introduces the usage of procedures and methodologies in conducting the census through the technical and field operations in detailed. Also, it shows the main resulting benefits of using GIS applications, arising challenges and obstacles during each phase. Secondly, the paper focuses on the main differences between this census and 2007 census in which traditional techniques has been used, and their effect on data quality and coverage. This paper aims to come up with new recommendations regarding the use of GIS technology in future statistical surveys and censuses, in order to achieve better data quality and higher coverage rate.

Alternative "optimal" calibration weights using a modified distance measure 132 Per Gösta Andersson, (Email) Department of Statistics, Stockholm university, Stockholm, Sweden When dealing with nonresponse in survey sampling calibration has proved to be a useful technique for dealing with bias for estimators of population totals using auxiliary information. The setup is that we take a random sample from a finite population, but due to nonresponse we only observe study variable values in the response set, which is a subset of the sample. The auxiliary information can be known either at the sample level or the population level, or both. Linear calibration as suggested in (1) and (2) is now widely used in National Offices for Statistics throughout the world. This type of calibration is akin to GREG estimation and has proved to be efficient especially in combination with simple random sampling. The distance function (measure) to be minimized corresponding to the resulting calibration weights under full response is of a simple chi-square type. It turns out that under nonresponse a similar function generates the weights which are presented in (1). This is shown in detail in (3), where it is also pointed out that a problem with the function which we want to minimize the value of, given our observation in the response set, is that we are still comparing the calibration weights with the original design weights. The latter weights should not be used under nonresponse. However, as is also shown in (3), there is an invariance property involved for many important cases for this type of calibration. Specifically, this means here that if we e.g. multiply the design weights with a constant larger than one, to compensate for the fact that we have nonresponse, the resulting weights will be the same. Furthermore, we will get the same effect if we try to group the observations where we in each group allow for a unique multiplicative constant.

Reducing response burden by using administrative cash machine data in Hungarian retail trade statistics 025 Agnes Andics, (Email), István Macsári, (Email), Zsolt Takács, (Email) HCSO, Budapest, Hungary New data sources play a key role in official statistics, since more and higher quality data are needed parallel with reducing reporting burden. Due to a new legislation introduced in 2014, enterprises operating cash machines involved in the online cash register system are obliged to send online information about sales in retail chain to the Hungarian tax office. It was an obvious demand to change the methodology of the retail trade statistics in such a way that exploits the potential in these data. Cash machine data contribute to reducing reporting burden of retail trade. Since the estimation of the retail selling is based on shop level, matching shops and cash machines was necessary. The mistakes made by cashiers and some missing data problem needed to be handled. Since some of the retailers are not obliged to use the online system, a part of the retail trade should be estimated without these data.

BIG DATA: EUSTAT EXPERIENCE, DEVELOPMENT OF THE PILOT PROJECT TOWARDS PRODUCTION 086 Jorge Aramendi, (Email), Elena Goni, (Email), Marina Ayestaran, (Email), Javier San Vicente, (Email) EUSTAT - Basque Statistical Office, Vitoria-Gasteiz, Spain “Big Data” represents a paradigmatic change for Official Statistics and Eustat, aware of this, has created a cross-sector group for all the institutions which have relied on the University of the Basque Country´s collaboration. In 2015, a data-capture pilot to establish hotel prices on the Internet was proposed, allowing Eustat to develop its own online data-capture programme, to examine data purging tasks in greater detail and to confront the challenge of storing large volumes of data. We are currently dealing with the analysis of data gathered between August 2017 up to the present. The analysis includes three new products and we are using Machine Learning techniques to carry it out. The three new products are: an alternative to ADR (Average Daily Rate) monthly estimates by region that Eustat publishes using data from the Tourism Survey, a spatial analysis of prices and hotel patterns depending on seasonal price variations. Python is the software that has been used for data collection and for the analysis of the results. The following sources of information were used to analyse hotel prices in the Basque Country: prices scraped from Booking, the Eustat Survey on Tourist Establishments and the Tourism Directory.

Adaptation of Winsorization caused by weight share method 116 Fizzala Arnaud, (Email) INSEE, PARIS, France The French Structural Business Statistics (SBS) production system, known as ESANE, has two main uses: - production of statistics based on the European SBS regulation; - estimation of businesses’ contributions to GDP for the national accounts. ESANE is currently changing to produce estimates based on profiled units or enterprises and no longer on legal units. Several methodological studies have been conducted to support this change, and the following study concerns the adaptation of the treatment of influential values by winsorization for the dissemination of results at the enterprise level. To manage the update of the set of legal units belonging to each enterprise, the generalized weight share method is used. The aim of this study is so to adapt the Kokic and Bell's method of winsorization to a weight sharing. We compare four scenarios with a simulation study based on the ESANE data 2016.

Cycle Extraction: Should the Hamilton Regression Filter be Preferred to the Hodrick-Prescott Filter? 040 Roberto ASTOLFI, (Email) 1, Philip CHAN, (Email) 1, Matthew DEQUELJOE, (Email) 1, Luigi FALASCONI, (Email) 1, 2 1 OECD, Paris, France 2 Universitat Pompeu Fabra (UPF), Barcellona, Spain This paper investigates whether the use of the Hamilton regression (HR) filter significantly modifies the business growth cycles of GDP, as compared to those extracted with the double Hodrick-Prescott (HP) filter in the OECD System of Composite Leading Indicators (CLIs). Hamilton (2017) recommends using a regression filter to overcome some of the drawbacks of the HP filter, which includes the presence of spurious cycles, the end-of-sample bias and ad- hoc assumptions on the smoothing parameters. In our analysis, we assess whether the use of the HR filter significantly modifies the number and the dating of turning points. We then measure the degree of synchronisation of the resulting cycles as well as the main features of their phases. Finally, we compare the behaviour of the two filters at the end-of-sample in a quasi real-time framework. Results suggest that, for most of the OECD and BRIICS countries, the reduction in the number of turning points, and hence in the number of cycles, is actually negligible. Moreover, our analysis reveals that the chronology of turning points and the end-of- sample direction of the HR filter is significantly more volatile than that of the HP. We conclude that the use of the HR filter for recurrent forecasting of turning points and end-of- sample direction of the growth cycle may be affected by large revisions, lessening the credibility of the message to policymakers in the long run.

Data(trans)forming 137 Roberto Barcellan, (Email) European Commission, Luxembourg, Luxembourg Data-driven organisations process and use ever more data to improve and speed up their decision-making. The goal is having superior insights through analytics. In data-driven organisations, decisions are supported by evidence (data). Smarter analytics technologies now enable every company to become more data-driven. Data-driven organisations perform better and are operationally more predictable, thanks to the insights they get from data and the advanced predictions modern technologies can do. Data(trans)forming describes the process to transform the European Commission in a data driven organisation.

Dwelling and Building Register- Identification of rented apartments 196 nofar ben haim, (Email) Central Bureau of Statistics, Jerusalem, Israel Dwelling and Building Register- Identification of rented apartments Nofar Ben-Haim, Central Bureau of Statistics Abstract In recent years, real estate has been a central issue in Israel. Shortage of housing and as a result, the high prices of housing for purchase or rent, made it essential to provide the policy makers with updated and relevant statistics and data. The Dwelling and Building Register (DBR) has been built in order to provide buildings and dwellings statistics as well as creating the infrastructure for future register- based censuses. The Israeli DBR is based on Municipal taxation lists, that were found to be the most reliable and near- complete data source. We collect data from 202 urban municipalities and 55 Regional Councils. Each municipality organizes the tax list according to its needs, the variables are defined differently and therefore, standardization and harmonization are needed for comparability purposes. Each record in the register carries information about the payer, the holder and the owner of the dwelling unit, and provides the opportunity to link the register data to other databases and registers via geography and population. Geo-coding the dwelling unit to the building level and to its aggregates serves for small area statistics, and hence, for effective decision making. Personal Identification Number (PIN) generated in the Central Population Register (CPR) serve for anchoring the population to the same units, and therefore we can estimate the household size and the relations between the owner and the holder of the dwelling unit. Due to the heterogeneity and under coverage of the ownership variable in the municipalities files, the ICBS calculates the ownership variable, based on calculated family relations between the registered owners of the dwelling and the registered holder of the dwelling, as found in the Central Population Register (spouses, siblings, et cetera). In comparison with the Labor Force Survey (LFS) this calculation found to be relatively reliable. However, the under-coverage of the DBR is high, 32.3%, and its over-coverage is of 9.9%. Model-based imputation is engaged in order to increase the coverage of the ownership variable, based on additional information provided by the municipalities and from the LFS. Developing the imputation model will involve testing the statistical correlation between the reported status in a specific municipality and the reported status in the LFS, as a well as a regression tree based on the LFS, mainly for municipalities that the current available information will not be suitable to be used in statistical correlations mechanism. The paper will discuss the decisions which lead to the new imputed ownership status, to the subsequent improved quality of the DBR, and to its possible uses for addressing the challenges that the current housing situation in Israel is presenting.

Satellite-Net: Automatic Extraction of Land Cover Indicators from Satellite Imagery by Deep Learning 045 Eleonora Bernasconi, (Email), Francesco Pugliese, (Email), Diego Zardetto, (Email), Monica Scannapieco, (Email) Istat, Rome, Italy In this paper we address the challenge of land cover classification for satellite images via Deep Learning (DL). Land Cover aims to detect the physical characteristics of the territory and estimate the percentage of land occupied by a certain category of entities: vegetation, residential buildings, industrial areas, forest areas, rivers, lakes, etc. DL is a new paradigm for Big Data analytics and in particular for Computer Vision. The application of DL in images classification for land cover purposes has a great potential owing to the high degree of automation and computing performance. In particular, the invention of Convolution Neural Networks (CNNs) was a fundament for the advancements in this field. In [1], the Satellite Task Team of the UN Global Working Group describes the results achieved so far with respect to the use of earth observation for Official Statistics. However, in that study, CNNs have not yet been explored for automatic classification of imagery. This work investigates the usage of CNNs for the estimation of land cover indicators, providing evidence of the first promising results. In particular, the paper proposes a customized model, called Satellite-Net, able to reach an accuracy level up to 98% on test sets.

Social indicators and Big Data: a case study on social indicators and active citizenship 192 Silvia Biffignandi, (Email), Camilla Salvatore, (Email), Annamaria Bianchi, (Email) University of Bergamo, Bergamo, Italy Big Data is one of the most discussed topics in Official Statistics. The potentialities of this new data source are relevant: Big Data can offer new macroeconomic now-casting opportunities for policy-makers, providing complementary and faster information on the state of the economy and its development. In particular, the combination of data from multiple sources can provide a better overview of the economic phenomena . Furthermore, in Official Statistics the integration of Big Data with traditional data sources is a challenging opportunity for the construction of social and economic indicators. Actually, it is unlikely that Big Data will completely replace survey-based activities: they can provide complementary and specific information about a topic or they can help to asses unmeasured or partially measured socioeconomic phenomena. At international level, the discussion about social indicators and in particular quality of life, well- being and beyondGDP activities is under constant debate. The measurement of the quality of life and wellbeing from an individual level perspective has become very important with the rise of “Social Indicators Movement” and social media represents a promising data source to study new topics and aspects. Within the European Statistical System, the “Quality of life indicators framework” has been developed to measure the quality of life considering not only the GDP, but also other complementary and subjective aspects. However, it is a static measure and the opportunities deriving from Big Data and, in particular from social media analysis is that we obtain dynamic indicators that show the changes over time and the reaction of people to particular events. On the other hand, new issues are rising. For example, social Big Data indicators “usually do not correspond to any sampling scheme and they are often representative of particular segments of the population”. The purpose of this paper is to use Twitter data to study social interactions and to provide an indicator of active citizenship. This is an on-going research, composed by two phases. The first one, which is already concluded [6], focuses on evaluating the overall quality of an analysis based on social media. To this purpose, we develop a case study focused on sentiment analysis of Twitter data, we discuss the possible sources of errors and how to get evidence of them as well as the users’ behaviour. The second phase focuses on the development of an active citizenship indicator: “contact with politicians” based on the framework proposed by Sánchez et al. . Then, we perform an in depth analysis to study the network relationship among users and topics discussed. More information are provided in the next section

The EuroGroups Register 018 Agne Bikauskaite, (Email), August Götzfried, (Email) Eurostat, Luxembourg, Luxembourg Globalisation presents significant statistical challenges, particularly for small and open economies in terms of measuring statistical indicators and communicating the results to users. The European Statistical System allocated high priority to the better measuring of globalisation in the statistical processes and output, in business or macro-economic statistics. One of concrete actions already undertaken is setting up of the EuroGroups Register (EGR) of multinational enterprise groups. The EGR is the central statistical business register of Eurostat and the EU and EFTA countries' statistical authorities. The EGR is part of the EU statistical infrastructure and has been built up to better capture globalisation effects as well as for improving the consistency of national data on enterprise groups. The EGR covers multinational enterprise groups operating in Europe. It provides the statistical authorities of the EU and EFTA countries with yearly population frames of multinational groups. The Register's main function is to provide the statistical authorities with a harmonised picture of multinational groups for their national statistics. The EGR has been growing in terms of quality over years and now covers more than 110 000 multinational enterprise groups in the EU. When at least one legal unit of a multination enterprise group is registered in the EU or EFTA country, the group is in the scope of the EGR.

Integration of volatile online prices into the consumer price index 070 Christian Blaudow, (Email), Daniel Seeger, (Email) Federal Statistical Office of Germany, Wiesbaden, Germany The online market is increasingly gaining in importance. Consumers buy more and more goods on the online market due to the great variety of product offers, time saving and independence regarding closing hours of physical shops. For the German Consumer Price Statistics, which comprises the National Consumer Price Index (CPI) and the Harmonised Index of Consumer Prices (HICP), the Federal Statistical Office (FSO) collects approximately 10,000 individual prices for products on websites of online retailers. The share of these products on the overall basket of goods and services amounts to approximately five per cent and will probably be rising in the forthcoming years. Thanks to fairly easy to adjust prices on the internet, online retailers are able to react to market conditions or consumer’s behaviour by adjusting prices automatically in short intervals, applying algorithms that take into account different parameters. This phenomenon is known as dynamic pricing. First studies investigating dynamic pricing in Germany have shown that different variants of dynamic pricing exist and are very heterogeneous and not transparent. Dynamic pricing of online retailers may lead to a bias in the index calculation since the traditional way of price collection via internet is done generally at one time during the month and therefore cannot capture rapidly changing prices. Therefore, in order to display reliable price developments in the CPI/HICP, consumer price statistics needs to constantly monitor the pricing behaviour on the internet and apply methods to evaluate the large amount of data and integrate very volatile price developments into price indices. The FSO has gathered numerous experiences through former studies in the topical subject of web scraping and also conducted a study investigating the extent of dynamic pricing on the German online market. The present paper deals with the applied techniques to monitor the pricing behaviour on the internet, includes research towards handling of dynamic pricing within the price collection for CPI/HICP and gives an overview of suitable methods when calculating indices.

Forecasting tourist arrivals with online data: An application to the Valencian Community 103 Desamparados Blazquez, (Email) 1, Fernando Reis, (Email) 2, Josep Domenech, (Email) 1 1 Universitat Politècnica de València, Valencia, Spain 2 Eurostat, Luxembourg, Luxembourg Tourism trips are increasingly planned and organised online. This generates some digital traces correlated with the tourist movement, and thus potentially useful to improve the accuracy and timeliness of forecasts. This hypothesis is based on the fact that before booking a trip, travellers and tourists look for information about the destination on the Internet. Therefore, we expect a significant relation between the online popularity of a destination and the real, physical visits it receives. It is therefore possible to take advantage of these digital traces with the aim to improve tourism forecasting. Previous studies have shown the capacity of online data to improve the forecasts of tourism-related variables in different regions. Assessing the potential and applicability of online behaviour data sources to support the production of official statistics is a new line in which statistical offices are working. Eurostat is performing some pilot studies on different big data sources applied to different fields, including tourism. The study we present in this paper is developed as a pilot to assess to which extent online sources, which are providers of big data, can help to predict real tourist movements. The aims of this on-going study are two-fold: Firstly, to check if online data from Google Trends and Wikipedia pageviews can help to improve the accuracy of forecasting models for tourist arrivals. Secondly, to compare the two online sources in order to assess not only which one performs better, but also to check if they are complements or, on the contrary, they are substitutes. This is fundamental for official statistics offices to decide which online sources are worth pursuing further investigation and introduction in official statistics production, e.g. to develop flash estimates and forecasting models based on these new sources of massive, timely and granular data.

Reproducible analysis with Renku 175 Andreas Bleuler, (Email), Roskar Rok, (Email) Swiss Data Science Center - ETH, Zürich, Switzerland RENKU is an open-source platform for data analytics designed with reproducibility, reusability and the ability to collaborate as its main concerns. As the researcher perform their analysis, data lineage is automatically recorded and seamlessly captures both the workflow within and across projects, allowing any derived data to be unambiguously traced back to the original raw data sources in a manner that is fully transparent. The results are verified by an automated build, allowing scientists to detect problems with reproducibility early in the process. To simplify the adoption of Renku and to ensure that the data, metadata and code captured by the platform is interoperable with other systems, we rely on frequently used open-source tools like git, GitLab, the Common Workflow Language, Docker and JSON-LD schemas. The platform allows users to perform analysis either off-line using a RENKU python client, or in a (self- )hosted cloud environment. While each project is a self-contained repository, workflows may seamlessly reference workflow steps from other projects and even projects on other RENKU instances.

Multi-source data integration with SBR 164 Steen Joergensen, (Email), Christian Bendtsen, (Email) Statistics Denmark, Copenhagen, Denmark Statistics Denmark’s Business Register (SBR) is very closely linked to an Administrative Business Register (Tax ABR) as well as a Central Business Register (CBR) administered by the Danish Business Authority. Together, the three institutions own and develop CBR. In the past few years, CBR has been developed with more information from other business registers that have become digital and often with the enterprise in charge of the update.

Design of a roadmap for implementing the CoP for the ENP south countries at ONS-Algeria. 015 tarik bourezgue, (Email) office national des statistiques, algiers, Algeria The rapid changes in the country have had and continue to have a profound impact on infrastructure, economic agents and the population as a whole. These transformations have already turned the statistical landscape upside down and will continue to do so. In fact, the need for statistical data has changed and evolved in terms of the nature of the statistics, on one hand, and requirements regarding availability, quality and time, on the other hand. Starting from the context described above, the development of our quality approach is essentially based on the following three principles: - the capitalization of cooperation initiatives with Eurostat on the principles of the European Quality Assurance Framework for Official Statistics (QAF); - the choice of a participatory and transparent process to enrich this approach and facilitate our making it our own; - and lastly, conducting the process step by step, in order to optimize its steering

Metadata driven monitoring of electronic data capture 085 Mauro Bruno, (Email) 1, Joshua Handley, (Email) 2, Aaron Whitesell, (Email) 2, Guido Drovandi, (Email) 1, Milena Grassia, (Email) 1, Paolo Giacomi, (Email) 1 1 Istat, Rome, Italy 2 United States Census Bureau, Washington, United States As mobile technology is becoming more widely available, many statistical agencies have considered using mobile data capture for the 2020 round of population and housing census data collection. The introduction of electronic data capture into the business cycle of the census provides cost and time savings, but also allows users to take advantage of added features that can be programmed into mobile devices or linked to the data collection process. These features include, among others, integrated maps and Global Positioning System (GPS) and real time monitoring of fieldwork. This paper shows how Istat has designed such a monitoring system, integrated with the census data collection process supported on Android devices by CSPro, the public domain software package developed by the U.S. Census Bureau. The proposed architecture, implemented to support the Ethiopian population census within a project financed by the Italian Cooperation, is generalised and available to other potential users free of charge. It provides a simple solution for monitoring electronic data collection operations, particularly in cases where the technical and financial resources to implement such a system from the ground up are lacking.

On the design of a reference architecture in Istat 064 Mauro Bruno, (Email), Giuseppina Ruocco, (Email), Monica Scannapieco, (Email) Istat, Rome, Italy In recent years, the National Statistical Institutes (NSIs) in the most advanced countries have been carrying out an in-depth analysis of their cultural, organisational, and technological context. The need of such analysis is connected to the new challenges NSIs are experiencing resulting from the considerable changes of the external (e,g. new statistical needs, new data sources) and internal context (e,g. budget cuts , self-contained organizational structures). To face these challenges NSIs are adopting an holistic approach in managing their business activities, and one of the key elements of the new approach is the Enterprise Architecture. At international level, Eurostat’s Vision 2020 has defined a roadmap based on the development of an Enterprise Architecture to support the modernisation process and the sharing of information and experiences between different countries. The paper will provide a high level description of Istat’s Enterprise Architecture, implemented according to the international standards, and the application of EA principles to a strategic project launched in Istat. optimStrat: An R package for assisting the choice of the sampling strategy 162 Edgar Bueno, (Email) Stockholm University, Department of Statistics, Stockholm, Sweden The sampling strategy that couples probability proportional-to-size sampling with the GREG estimator has sometimes been called "optimal", as it minimizes the anticipated variance. This optimality, however, relies on the assumption that the finite population of interest can be seen as a realization of a superpopulation model that is known to the statistician. Making use of the same model, the strategy that couples model-based stratification with the GREG estimator is an alternative that, although theoretically less efficient, has shown to be sometimes more efficient than the so-called optimal from an empirical point of view. We compare the two strategies from both analytical and simulation standpoints and show that optimality is not robust towards misspecifications of the model. In fact gross errors may be observed when a misspecified model is used. The comparison is made using approximations to the anticpated variances. These approximations are obtained analytically and by simulation. optimStrat, an R package that allows the user to perform his/her own simulations is introduced.

Efficient screen real estate management: improving data visualization for small screens 191 Jorge Camoes, (Email) Wisevis, Lisbon, Portugal Even the simplest data set can be visualized from multiple perspectives. Not only our options are situation-dependent, but the criteria we use to evaluate them are subject to discussion. More than a set of algorithms to be applied in each situation, data visualization is a flexible visual language with multiple purposes, where “speakers” can develop their unique style. When paired up with technology, this openness somewhat blurred the traditional divide between exploration and explanation / communication in data visualization. Other fault lines emerged in recent years: tools (programming languages vs. point-and-click applications), interaction (dynamic vs. static), medium (analog vs. digital). In a context of an ever-increasing volume of data, and more types of data that could be visualized, more people are entering the field. Some have a less traditional profile (graphic designers, journalists), while others feel the need to add visualization and design to their core skills (statisticians).

Citizen to Government Geospatial Data partnerships. What can we learn from and recommend to those working in the official statistics domain? 159 Javier Andres Carranza Torres, (Email) 1, 2, Javier Andres Carranza Torres, (Email) 1, 2 1 GeoCensos Foundation, Bogotá, Colombia 2 Twente University ITC faculty, Ensched, Netherlands The Cape Town Global Action Plan for Sustainable Development Data subscribed in 2017 at the first World Data Forum highlights the need of National Statistical Offices (NSOs) to adapt to evolving demands. This need is triggered by all kinds of decision-makers, specially from governments under constant pressures to deliver focused, tailored and timely solutions. Based on a classification framework adapted from the e- governance concept, this document takes stock of various international, regional and local initiatives thought to aid or monitor government actions and that are granting access to and use of non-traditional data sources. These are citizens to government data partnerships, that use technologies like open geospatial information platforms and can complement surveys and traditional censuses in the official statistics data stream, with special focus in demographic and social statistics. Recommendations will address the question of how civil society NGOs can strengthen their general capacities inside the statistical production processes to effectively support NSOs through collaboration projects at national and international levels.

MARS: A method for linking barcodes and stratifying products for price index calculation 161 Antonio Chessa, (Email) Statistics Netherlands, CPI department, The Hague, Netherlands The increased availability of electronic transaction data for the consumer price index (CPI) offers possibilities to national statistical institutes (NSIs) to enhance the quality of index numbers. More refined methods can be applied that deal with the dynamics of consumption patterns in a more appropriate way than traditional fixed-basket methods. For instance, multilateral methods can be used to specify sales based weights at the most detailed product level and new products can be directly included in index calculations. Electronic transaction or scanner data sets contain expenditures and quantities sold of items purchased by consumers at physical or online sales points of a retail chain. The sales data are often aggregated by retailers to a weekly level and are specified by the barcode or Global Trade Item Number (GTIN) of each individual item. Transaction data sets also contain characteristics, such as brand and package volume, of the items sold. While traditional price collection methods typically record prices of several tens of products in shops, electronic transaction data sets may contain several tens of thousands of items at the GTIN level for a single retail chain. GTINs represent the most detailed product level in electronic transaction data sets. Each item has a unique barcode. In principle, this means that NSIs are given a set of tightly defined products. The ratio of monthly expenditure and quantity sold yields a transaction price, which can be followed for each product/GTIN from month to month. However, items may be removed from the market and reintroduced with a modified packaging, for instance, in order to fit within a retailer’s new product line. Quality characteristics of such “relaunch” items may remain the same, but the barcodes may change after reintroduction and also the prices compared with the prices under the previous GTINs. The barcodes of the old and new, reintroduced items have to be linked in order to capture price changes under such relaunches. Typical market segments that are characterised by relaunches are pharmacy products, clothing and electronics. Rates of item churn may reach such high levels that each year new product lines are introduced that replace the former ones. The GTIN level is not appropriate as product level in such situations. GTINs of relaunch items have to be linked, which means that broader product concepts are needed. The problem of identifying “suitable” levels of product stratification has to be resolved before applying index methods to calculate price movements from period to period. The key question is what is found to be a suitable level of product stratification and how this notion could be formalised and operationalised. In addition, the size of electronic data sets calls for a method that enables statistical agencies to automate the stratification process to a high degree. This paper presents the method MARS, which stratifies products by balancing product homogeneity and the extent to which products can be followed over time. The inclusion of the latter measure allows to identify relaunches and corresponding price changes. Results for televisions and hair care are shown.

Transition to JDemetra+ in a centralised system for seasonal adjstment: issues and benefits 167 Giancarlo Bruno, (Email), Anna ciammola, (Email), Francesca Tuzi, (Email) istat, rome, Italy Seasonal adjustment is a very demanding activity for National Statistical Institutes (NSIs). The required resources, in fact, involve high skilled statisticians and econometricians, efficient IT procedures and a continuous updating of seasonal adjustment methods and tools. In order to conduct the annual campaign, as suggested by the European guidelines on seasonal adjustment, and to assure coherence in the treatment of peculiar situations (such as crisis, short series, seasonal heteroskedasticity, etc.), NSIs generally work through a well-structured organisational system. A unit service deals with time series and seasonal adjustment at central level while several other local units seasonally adjust data routinely. When a problem arises the latter asks for the help of the former. Given this organisation and chosen a common seasonal adjustment method (usually the choice is between the ARIMA model based method and the moving average based method), there may be several and different ways to approach the seasonal adjustment, ranging from the simplest one where common procedures are run on the own domain using personal computers to the most complex one where a unique centralised system is used. The latter approach is implemented in the Italian National Institute of Statistics (Istat) through an informative system hereafter called SITIC. It is aimed at storing, seasonally adjusting and disseminating short term statistics coming from all business surveys. The aim of this paper is twofold. Firstly the main features of SITIC are described stressing the solutions implemented to cope with some methodological issues concerning the aggregation of chained linked indices and drawing attention to the benefits implied by the management of a centralized system. Secondly the project aimed at introducing JDemetra+ in SITIC is presented. Issues, costs and advantages are listed and the peculiarities of the project are highlighted. In fact the transition to JDemetra+ in SITIC requires neither the use of the graphical interface nor the use of batch procedures like JDemetra+ Cruncher, but it is based on the integration of Java modules avoiding or minimizing any changes in SITIC interface.

A live multi-source statistical (Business) Register 138 Barry Coenen, (Email), John Hacking, (Email) CBS, Heerlen, Netherlands In a country a lot of administrative sources are available. But all these sources provide their own view on the world. Understanding the content and the quality of the sources is essential in processing them and combining it to one statistical view. We receive sources in different timings, in different forms and with different contents, but together they provide a complete view on the national economy. We have a metadata system which allows us to understand our sources and be clear where they have overlap or are the only source of information. These different administrative worlds are combined into one statistical world. For our user we have set up a separate environment where they can access the statistical data at their own. They are flexible in selecting just the data which they when they need it.

Equitable and Sustainable Well-being indicators for small areas 005 Cecilia Colasanti, (Email) 1, 2, Flavia Marzano, (Email) 1, Clementina Villani, (Email) 1 1 Roma Capitale, Rome, Italy 2 Istat, Rome, Italy

The aim of this paper is to describe methods and results about the evaluation of Equitable and Sustainable Well-being indicators for the Municipality of Rome. Since 2010, the Italian National Institute of Statistics (Istat) elaborated a set of indicators, called Equitable and Sustainable Well-being (ESW), to measure the progress of Italian society not only from an economic point of view, but also from a social and environmental perspective. This year, for the first time, Roma Capitale valorised 75 indicators related to the roman well-being conditions. The main data sources are administrative data taken from National Institute of Statistics (Istat), Municipality of Rome, Italian National Institute of Insurance for Accidents at Work (INAIL), Ministry of Economy and Finance, Ministry of the Interior Affairs, Ministry of Justice. This report will be yearly updated. The local indicators were calculated through the small areas estimation method. This work is framed within the discussion about sustainability at global level and has to be considered a tool that gives to policy makers the opportunity to take decisions based on the available data and information.

How to enlarge our audience: attract and explain 188 Louise Corselli-Nordblad, (Email) Eurostat, Luxembourg, Luxembourg It is increasingly important for citizens to understand statistical figures presented in newspapers, on the internet and elsewhere. Due to social media and internet, it is easy to rapidly communicate and spread statistical figures of both good and bad quality. In order to help people to understand the figures, it is vital to increase statistical literacy by explaining what they mean. It is equally important to make statistics easily accessible and more attractive in order to appeal to people who are not familiar with statistics. In a fast moving world, there is also an increasing demand among users of statistics to have short and catchy texts as well as sharable and interactive products. Taking these demands into consideration, they can be summarised by two keywords: attract and explain! Over the past years, Eurostat has produced a wide range of dissemination and visualisation tools. These tools do not only explain statistics in an easy and understandable language, but they also visualise statistics in a clear and explanatory way and can be shared via social media. These efforts have been undertaken to attract those not so familiar with statistics and to increase their statistical literacy. This abstract presents examples of such Eurostat dissemination tools.

Creating a synthetic database for research in migration and subjective well-being: Statistical Matching techniques for combining the basic and complementary questionnaires of the Hungarian Microcensus 2016 020 Zoltán Csányi, (Email), Gergely Bagó, (Email), Anna Ligeti, (Email), Zita Ináncsi, (Email), Ferenc Urbán, (Email), Zoltán Vereczkei, (Email) Hungarian Central Statistical Office, Budapest, Hungary In 2016, with the aim of tracking social trends between full-scope censuses, the Hungarian Central Statistical Office (HCSO) carried out Microcensus, a population survey based on an unusually large sample covering 10 percent of the Hungarian households. Apart from the basic questionnaires on dwellings and personal information, selected households were asked to fill in one of the following complementary surveys on a) international migration, b) subjective well- being, c) social stratification, d) occupational prestige, and e) health problems. From a methodological point of view, the Microcensus dataset with the above described structure of basic and complementary questionnaires invites for performing a statistical – or synthetic – matching exercise. This method, in accordance with the UNECE Data Integration Guide [1], “involves the integration of data sources with usually distinct samples from the same target population, in order to study and provide information on the relationship of variables not jointly observed in the data sets”. That is, the statistical matching exercise resembles an “imputation problem of the target variables from a donor to a recipient survey” on the basis of common variables [2]. The HCSO methodologists and experts of population and migration statistics embarked on creating such a synthetic Microcensus database of the complementary modules on the basis of the variables from the basic questionnaires. The resulting dataset – unique in terms of sample size – will contain, apart from the information obtained in the basic questionnaires, the estimated/imputed data from each of the complementary sets of questions. Thus, it will serve to develop analytic richness through making possible the study of the diverse relationships between a wide range of variables never observed together in Hungary. In their joint effort, as a first step, the multidisciplinary project team is currently working on identifying the best solutions to combine the basic and two of the complementary questionnaires: the one on migration and another on the aspects of subjective well-being. This presentation – instead of entering in the details of the analytical results – focuses on the methodological questions of creating such a synthetic database and the evaluation of the usability and potential of the output dataset. It also gives an overview of the first phase of and the lessons already learnt from the ongoing experiment and insight into the next steps.

Designing Urban Experience with Rhythm and Data: Lessons from the City Rhythm Project 194 Caroline Nevejan, (Email) 1, Scott Cunningham, (Email) 2 1 Chief Science Officer, Municipality of Amsterdam, The Netherlands, Amsterdam, Netherlands 2 Faculty of Technology, Policy and Management; Delft University of Technology, Delft, Netherlands City Rhythm investigates social cohesion in selected Dutch cities. The project incorporated senior researchers, policy-makers and students in a series of cooperation, creation and planning exercises. Although citizens grow safer year by year, the perception in many Dutch neighborhoods is one of anxiety and the feeling is that the neighborhood is disorderly. The key to creating social cohesion involves creating shared experiences using rhythm. Investigations include six cases of social cohesion in neighborhoods. The cases are coupled with comprehensive analyses of urban populations using open administrative and microdata.

Targeting a wider public – storytelling with statistical data 008 Zsolt Czinkos, (Email) Hungarian Central Statistical Office, Budapest, Hungary

Reaching a wider public with statistical data is difficult. Smartphones and increasing mobile bandwidth are changing user expectations. Statistical visualizations should meet the standards of current user experience. The Hungarian Central Statistical Office has started to create interactive storytelling infographics and data visualizations to highlight interesting facts, to explain terms, to show results – to improve statistical literacy. Creating customised story visualizations is challenging. It requires cooperation between people from different domains: software development, statistics, communication, management, visualization. It also needs software tools. Publishing to both mobile and desktop environment requires responsive design and cross-browser compatibility. Tools exist, but development is expensive. Is it worth it? Number of visitors should be measured, feedback should be received. This presentation offers insight into the development process of a published interactive storytelling visualization highlighting technical details.

Measuring poverty and social exclusion by small area estimation 098 Michele DAlo, (Email) 1, Gaia Bertarelli, (Email) 2, Loredana Di Consiglio, (Email) 1, Andrea Fasulo, (Email) 1, Maria Giovanna Ranalli, (Email) 3, Fabrizio Solari, (Email) 1, Guandalini Alessio, (Email) 1, Daddi Stefano, (Email) 1 1 Istituto Nazionale di , Rome, Italy 2 University of Pisa, Pisa, Italy 3 University of Perugia, Perugia, Italy The objective of this work is to provide a statistical tool that can drive local policies on the basis of urban specificities. For this purpose, very detailed and updated statistical information at fine geographic level is necessary. Typically, the former aspect has always been assured by the Census that until now had the limit of providing data on a decennial basis. Such a temporal discrepancy is no longer acceptable nowadays. The timeliness of the information is, on the other side, assured by sample surveys, which, unfortunately, have limitations on the territorial level dissemination: the estimates are, in fact, usually produced at regional level. From these considerations, it emerges the need to provide solutions that exploit the availability of new sources of information, such as administrative data. The integration of this information with survey data can overcome the lack of information at a more detailed territorial level, assuring simultaneously timely and accurate estimates. NSIs have started to produce social and economic indicators using administrative data at local level. However, due to a different taxonomy, these indicators do not coincide with those usually computed by means of sample surveys. Therefore, the information from administrative data is often not consistent with the information officially produced at the regional level with sample surveys. The aim of this work is, first of all, to compare the indicators computed by the two sources of information, for all the metropolitan cities in Italy, for some large municipalities and for functional aggregations of small municipalities. The following step is to use the administrative data as an auxiliary source for model based estimation or for projection-type estimators. The output of this step allows us to evaluate the results obtained on important indicators of social exclusion and well-being, typically produced with the EuSilc (European Union Statistics on Income and Living Conditions) survey. In particular, we focus small area estimates of poverty rate, low work intensity and quantile share ratio indicators, computed at provincial and metropolitan municipalities level.

Statistical learning in official statistics: the case of statistical matching 063 Marcello D'Orazio, (Email) Office of Chief Statistician, Food and Agriculture organization (FAO) of the United Nations, Rome, Italy National Statistical offices are facing the challenge of modernizing their statistical production processes, beyond traditional sample surveys and censuses, so as to exploit all available data provided by administrative registers and big data. Taking advantage of large data sources requires adoption of modern statistical methods, as those based on machine learning. In addition, availability of different data sources on the same phenomena poses the challenge of integrating them for producing a wider set of statistical outputs so as to satisfy users’ request. This work will show how statistical learning methods can be beneficial in integrating data. Statistical learning (SL) is an area of statistics relatively recent (see e.g. [1] and [2]) that includes a wide set of techniques that “learn from the data”. They have become very popular in marketing, finance, and other domains, because allow analysis of large data sources, with many variables and observations. Under SL umbrella falls many recent methods related to classification, regression and clustering (generalized additive models, classification and regression trees, neural networks, etc.). Integration is the core of new statistical production processes aimed at providing a richer set of statistical outputs by taking advantage of already existing data, avoiding setting up new surveys. Focus here is on statistical matching (SM, also known as data fusion) whose objective is integration of data sources (mainly from sample surveys), lacking of units’ identifiers, to investigate relationship between variables not jointly observed in the same survey (see e.g. [3]). These methods are frequently applied to integrate the survey on household income with the one on expenditures to get a thorough picture of people well-being [4]. SM methods include a variety of well-known methods developed to impute missing values in a dataset (predictive mean matching, hotdeck imputation, etc.), but adapted to the specific SM setting.

Network analysis on Persons for Official Statistics 112 Edwin de Jonge, (Email), Jan van der Laan, (Email) Statistics Netherlands (CBS), Den Haag, Netherlands The census, which is as old as civilization, is the origin of Official Statistics. It measures various sociodemographics properties of the population which were, are and will be of importance for social scientists, historians, policy makers and government. A census is a very valuable source of information but its description of society formed by the connections between people is very limited. It records people living on the same address, but it fails to capture the broader network of relationships: family, friends, neighbors, co-workers and acquaintances. Most demographic statistics describe (aggregates of) properties of inhabitants. If Official Statistics strives to measure society, describing the network of relations between people that form the fabric of society can be a source of interesting demographic statistics. Is the strength of family ties regionally correlated? How diverse are personal networks? Given the current demographic trend in most countries that the average age is increasing, do parents live close to their children or is this distance increasing? At Statistics Netherlands a research program was formed that tries to use complexity science and network analysis to derive new and additional official statistics. Several projects are started including deriving a enterprise to enterprise network for describing economic networks, but also a project to derive a social network of the Netherlands. This abstract describes the derivation of a directed network with family, dwelling, neighbor, school going children and coworkers relationships for the 17 million inhabitants of the Netherlands and some of its potential use for producing official statistics from it.

Assessing and adjusting bias due to mixed mode in Aspect of Daily Life Survey 088 Claudia De Vitiis, (Email), Francesca Inglese, (Email), Alessio Guandalini, (Email), Marco Dionisio Terribili, (Email) Italian National Statistical Institute, Rome, Italy The mixed mode (MM), i.e. the use of different collection techniques in the same survey, is a relatively new approach that ISTAT, as well as other NSIs, is adopting especially for social surveys. Its use is spreading both to contrast declining response rates and to reduce the total cost of surveys. Using different data collection techniques, in fact, helps in contacting different types of respondents in the most suitable way for each of them, allowing a gain in population coverage and response rate. However, it introduces a bias, named mode effect, that must be faced at different levels: in the design phase by defining the best collection instruments to contain the measurement error; in the estimation phase by assessing and treating the bias effects due to the use of MM, in order to ensure the accuracy of the estimates. The surveys based on MM must be designed, in fact, considering the accuracy of the produced estimates, that must be consistent and comparable with the analogue ones obtained in the previous survey editions, for ensuring that changes in the time series are exclusively due to real changes of the observed phenomenon. The focus of this work is the experience in the evaluation and treatment of MM effect in the experimental situation of ISTAT survey “Aspects of daily life - 2017”, a sequential web/PAPI survey for which an independent control single mode (SM) sample PAPI was planned to make an assessment of the introduction of the mixed mode. The aim of the analyses presented in this work is to evaluate first the impact on the survey estimates of the introduction of MM design, with respect to the previous single mode design, and subsequently to analyse in depth the reasons that determine significant differences in the estimates obtained with the two samples. The study is developed on several levels of analysis: the first level makes the comparison between SM and MM samples; the second level assesses the mode effect (selection and measurement) in the MM sample.

Measuring the Quality of Multisource Statistics 019 Ton de Waal, (Email) 1, 2, Arnout van Delden, (Email) 1, Sander Scholtus, (Email) 1 1 Statistics Netherlands, The Hague, Netherlands 2 Tilburg University, Tilburg, Netherlands The ESSnet on Quality of Multisource Statistics – also referred to as Komuso – is part of the ESS.VIP Admin Project. The main objectives of that latter project are (i) to improve the use of administrative data sources and (ii) to support the quality assurance of the output produced using administrative sources. The aim of the ESSnet is to produce quality guidelines for National Statistics Institutes (NSIs) that are specific enough to be used in statistical production at those NSIs. The guidelines are expected to take the entire production chain into account (input, process, and output). They also aim to cover the diversity of situations in which NSIs work as well as restrictions on data availability. The guidelines will list a variety of potential indicators/measures, indicate for each of them their applicability and in what situation it is preferred or not, and provide an ample set of examples of specific cases and decision-making processes. The first Specific Grant Agreement (SGA) of the ESSnet lasted from January 2016 until April 2017. The second SGA started in May 2017 and lasts until mid-October 2018. A third and final SGA is planned to start mid-October 2018 and end mid-October 2019. Work Package (WP) 3 of the ESSnet focuses on developing and testing quantitative measures and indicators for measuring the quality of output based on multiple data sources and on methods to compute such measures and indicators. Examples of such quality measures and indicators are bias and variance of the estimated output. Methods for computing these and other quality measures and indicators often depend on the specific situation at hand. Many different situations can arise when multiple sources are used to produce statistical output, depending on the nature of the data sources and on the kind of output produced. Therefore we have identified several basic data configurations for the use of administrative data sources in combination with other sources, for which we propose, revise and test quantitative measures and indicators for the accuracy, timeliness and coherence of the output. In this paper we discuss WP 3 of Komuso and some of the results obtained. Section 2 describes the approach taken in WP 3. Section 3 gives some examples of quality measures and methods to compute them. Section 4 concludes this paper with a brief discussion.

Perturbative methods for ESS census tables 016 Peter-Paul de Wolf, (Email) Statistics Netherlands (CBS), The Hague, Netherlands The population census has been an important output of official statistics for a long time. In the current global situation, researchers are increasingly interested in combining census information from different countries. In Europe an attempt to facilitate this more easily was taken by Eurostat in developing the Census Hub: a software system where census tables from the participating EU member states can be found. These member states each filled the Census Hub with their own census tables. The census tables are about personal information and are giving information at quite detailed level: crossings of many descriptive variables, often with detailed categories. Hence, even though tabulated data, some statistical disclosure control methods are needed to protect the individual privacy of the people in the census tables of those member states. Up until the 2011 census, all member states defined and used their own 'proper' SDC method(s). This led to the undesired situation that although the tables from the member states were available through a single portal, combining the information over different countries in a useful way was in some cases difficult if not impossible. Consequently, a project in which the European Centre of Excellence on SDC was asked to harmonise the SDC approaches of the different countries was launched. As a result two SDC methods were proposed: targeted record swapping (TRS) and the cell-key method (CKM). These methods were tested by the partners of the Centre of Excellence on SDC that took part in the project, using SAS software obtained from ONS. By suggesting to use one or both of these methods as harmonised approach by the EU member states, it became apparent that a more general implementation in (existing) open source SDC software would be needed. Hence, Eurostat asked the Centre of Excellence on SDC to provide these implementations. In the current extended abstract, we will quickly describe the suggested methods and introduce the implementations that have been developed.

CYSTAT: The road to modernisation 073 Costas Diamantides, (Email), Charoulla Charalambous, (Email) Statistical Service of Cyprus, NICOSIA, Cyprus The overall organization of statistical activities in the Statistical Service of Cyprus (CYSTAT) is based on the traditional stove pipe model in which all the production processes are decentralized and carried out by the subject-matter Divisions/Sections. However, nowadays, there is an increasing number of National Statistical Institutes in which, the production processes are carried out in standardised procedures regardless of the content of the production output. As a result, there is increased efficiency and improved output quality. A valuable tool in the specification of the standard production steps is the Generic Statistics Business Process Model (GSBPM) which CYSTAT started to implement in 2018 aiming to standardise the processes of the statistical production and abandoning the stove pipe model. At the same time, the current technical infrastructure of CYSTAT faces several challenges. Thus, there is a need for the development of a statistical data warehouse to ensure the central storage and easy access to the statistical data and metadata. CYSTAT is a small office with limited human and financial resources. Any reorganisation requires a significant amount of time to be invested in setting up the new structures and processes while at the same time the business continuity should not hampered. The development of a statistical data warehouse requires expertise which is not available inhouse. Thus, an opportunity for CYSTAT to modernise was brought up within the framework of the expansion of the Government Data Warehouse (GDW).

On data literacy in the context of rational ignorance – some evidence from the Eurobarometer survey 044 Lyubomira Dimitrova, (Email) Sofia University, Department of Public Administration, Sofia, Bulgaria As the amount of data we generate increases exponentially, so does the demand for the ability to analyses it. As Wolff et al. [1] describe it, ensuring that every citizen possesses the required skills to interpret data, to understand its limitations and to be able to use it is a must in this context. Schield [2] adds to this argument, as data literacy, along with information and statistical literacy are the three pillars of critical thinking. Therefore data literacy is an important factor for active political participation and appropriate reaction to propaganda and fake news, which rely mainly on emotions and subjective interpretations. Despite the existence of various programs aiming to promote data and information literacy poll results still show a rather unsatisfying state and there is a popular explanation to that. According to rational ignorance theory, when the cost of acquiring information is greater than the benefits to be derived from the information, it is rational to be ignorant. [3] Usually this approach is used for analyzing voting behavior and the general assumption is that being informed about the political agenda for each and every candidate requires too much effort, therefore the voter will choose her candidate based on other, less time consuming criteria.

Measuring MNEs using Big Data: The OECD Analytical Database on Individual Multinationals and their Affiliates (ADIMA) 117 Nadim Ahmad, (Email), Diana Doyle, (Email), Graham Pilgrim, (Email) OECD, Paris, France Despite their significant and growing importance, with implications across a range of policy areas, information on Multinational Enterprises (MNEs) remains at best patchy. This is partly a function of complexity: by their very nature, MNEs are large, with a multitude of activities across a number of jurisdictions. However, for firms engaging in fiscal optimisation at least, it is also partly a function of design: some firms for example create elaborate chains of affiliates, holding companies and special purpose entities, designed to minimise taxes, but the consequence is also to obfuscate. Another factor that complicates the measurement of MNEs is the limited possibility for National Statistical Institutes (NSIs) to obtain a holistic view of their activities, reflecting legislation that typically restricts data collections to activities within their economy or (and only very rarely) to the global activities of firms headquartered in the economy (and even in these cases it is not clear that the coverage of the MNE’s activities is exhaustive). The sharing of data across countries could provide a window to provide this holistic view but legal constraints aimed at preserving confidentiality and privacy of respondents within national borders in most countries mean that this is not, at least for now, possible. To begin to address these challenges, the OECD has begun to develop an Analytical Database of Individual MNEs and their Affiliates (ADIMA), by compiling publicly available statistics on the scale and scope of the international activities of MNEs, thus providing a unique ‘whole of the MNE’ view.

Application and quality assessment of simulated geo-coordinates for regional analysis of the parliamentary elections for the Bundestag 2017 201 Kerstin Erfurth, (Email) EMOS Master Thesis Competition, Berlin, Germany Data collection and data interpretation have become main issues in modern information society. Concerning this matter preparation and visualisation of data are essential aspects for any analysis. With an appropriate illustration one will get easy access to the underlying information of data. Therefore, the chosen illustration method can have a major influence on the interpretation of the data. For data containing geographic references, map presentations are frequently used as they are the best method to illustrate spatial relations with the help of coloured symbols, borders and areas. The data discretization and the colour codes for the categorisations may result in huge differences of its visual impact. When dealing with aggregated data the approach of pre-processing becomes a key issue for obtaining informative maps. To get comprehensive information for all geo-coordinates of the region of interest a new non-parametric approach for density estimation named “kernelheaping” is applied and evaluated. Based on the election results for the German Bundestag in 2017 the new technique is compared against standard choropleth maps and some modifications, like normalisation. The iterative kernelheaping procedure is also compared to non-iterative variants of kernel density estimators. In addition, the new approach is evaluated statistically with an underlying known real-world density on different aggregation levels. The Master thesis was written in cooperation of the Statistical Office of Berlin-Brandenburg and the statistical department of the economic faculty of Freie Universität Berlin. Beforehand the author completed an internship at the office. In this internship the kernelheaping method was set up and tested on preliminary election data and subsequently put into practice on the election night by using real time data. The results of the regional analysis were published in the official election report and the journal of the Statistical Office of Berlin-Brandenburg.

The Nautile project of new French Master Sample 126 Sébastien FAIVRE, (Email), Thomas MERLY-ALP, (Email), Ludovic VINCENT, (Email), Clément GUILLO, (Email), Laurent COSTA, (Email) INSEE, Paris, France Household surveys carried out by INSEE are mainly face to face surveys. In order to reduce field costs, it appears necessary to concentrate surveys in specific areas called primary units. In order to keep a stable interviewer network, INSEE uses a Master Sample system: Primary Units are drawn at once at the beginning of the system, and will be used for all surveys drawn in the Master Sample (with exception of LFS survey). According to international good practices, primary units shall be renewed each ten years. The current Master Sample started in 2009, and Primary Units shall be renewed at the end of 2019. Renewing the Primary Units is also an opportunity to redesign The Master Sample, taking advantage of new sample frames available based on tax payers' files. The actual INSEE Master Sample is namely based on the French rolling census (with the yearly Census samples used as sample frames), but the Census Rotation Groups that were drawn in 2003 are becoming less and less representative. It was therefore necessary to undertake a specific reflexion on the sample frames used by INSEE for social surveys. The Labour Force Survey sample is specific, as it is composed of clusters of 20 dwellings, as close as possible, due to the shorts period after the reference week to carry out the survey. A cluster is quarterly surveyed, during six quarters. Groups of 6 quarters (sectors) are built: when a cluster has been fully surveyed, another one from the sectors replace it in the sample. The LFS sample is therefore calibrated to last 9 years. As the current LFS sample started in summer 2010, the next LFS sample will enter into force in summer 2019, and shall be therefore redesigned at that time

Statistics Norway and implementing ModernStats models 168 Trygve Falch, (Email) Trygve Falch, Oslo, Norway As part of the modernization effort in Statistics Norway, there have been a coordinated and focused effort to base the modernization on the ModernStats standards, including, GSBPM, GSIM and CSPA. Operationalizing these standards in a systematic way have always been a challenge, but through the use of Open Source tools, and creating Open Source components, and using modern system development techniques and best practices we’re closer to being able to put all the standards in production wide use on the new production platform Statistics Norway is creating as part of its modernization efforts.

First step towards a digital National Statistical Institute 075 Concetta Ferruzzi, (Email), Maria Assunta Del Santo, (Email), Daniela Carbone, (Email) Italian Statistical Institute, Rome, Italy Digitalisation is a real innovating project that has an impact on organizational structure of National Statistical Institute in terms of support services and statistic production competencies, results and impact and should bring clear benefit to the core process as: efficiency, speed, new capacity in customer service and stakeholder satisfaction. This work is based on a qualitative case study methodology focused on Italian Statistical Institute (ISTAT) which defines a programme for digital transformation and implements a digital platform to support documental flows. In 2016 the ISTAT identified digitalization as one of the seven strategical objectives that participate in the realization of modernization programme.The on going digital transformation process represents a real revolution on the organizational plan and has two priorities: the strengthening of administrative capacity and data and process digitalization. Concerning the operating plan the digital transformation programme is based on digitalisation of documents, processes and data streams. This complex programme is benefiting of the use of the Portfolio management transformation methodology from the very beginning of the planning phase. This model was implemented through the following phases: (1) the translation of the strategies into iniziatives; (2) the identification of the projects and activities; (3) the prioritizing, the evaluating and balancing of the portfolio. The first result achieved was the implementation of the digital documentary systems. The system is unique for the whole Istat, and all personnel, managers and not, are authorized and able to use it. The system integrates different components (official digital documents register - start up in July 2016; integrated certified emails-start up in March 2017, workflows and electronic signature- start up on 1st January 2018; digital preservation - start up in January 2018). During 2018 further IT capabilities have been included in the digital documentary systems. This first result has contributed to the implementation of a single documentary database. It allows to overcome the old usage of private silos of the different organizational structures, and to allow the sedimentation of the current documents archive of the Institute that is part of the documentary and archivistic heritage of the State Archives system. One of the results of the project was the promotion of several initiatives to implement the documentary workflow digitalization also related to the production processes and statistic dissemination that allows the integration with the overarching processes. An important step in the digitalization process is represented by the integration of the IT component for the massive certified emails within the digital documentary system. A further impact of the project is the implementation of redesign phase and simplification in digital key of administrative procedures and services not existing before, in accordance with privacy rules and full technological security. Main lessons learned on different plans: clear objectives, output definition, change management, team management, project leader role; organisational culture.

Supervised Learning as a Method to Reduce Clerical Effort 095 Joerg Feuerhake, (Email) Federal Statistical Office Germany, Wiesbaden, Germany With the availability of more and more computing power Machine Learning methods become more relevant in the production of statistics. One important field of application is the classification of statistical units based on models trained with units, where the classification is known. In this paper an approach to classify units from a business database based on prior clerical review is presented. The goal is to remarkably reduce clerical effort in the statistic’s production process. Consider the case where a share of roughly 2% of a population about 600.000 units is not relevant for the results of a certain annual statistic. There are several reasons for a unit to become irrelevant for the statistic and the reasons depend on items like size, economic activity and other rationally and nominally scaled variables. Additionally assume that a unit’s relevance in recent periods was controlled by clerical review. So each year all units entering the population or changing in an important variable have to be checked manually. There are on average 40.000 units each year that previously needed clerical review. Thus the staff bound by these reviews was considerable, let alone the training to enable staff members to review cases correctly and the time needed to do the reviews. In the presented project, methods of supervised learning are applied to achieve the above mentioned goals. Random Forests and Support Vector Machines (SVM) are trained in a combined approach based on populations of prior years to get models that would be able to predict the units that enter the population or change in important variables.

Research into using alternative data sources in the production of consumer price indices, ONS 130 Tanya Flower, (Email) Office for National Statistics, Newport, United Kingdom Alternative data sources such as web scraped and point of sale scanner price datasets are becoming more commonly available, providing large sources of price data from which measures of consumer inflation could potentially be calculated. The ONS has been carrying out research into these data sources since 2014. ONS has recently acquired a robust source of web scraped data from a third-party supplier and are continuing to pursue scanner data. Given this progress with acquiring alternative data sources, ONS has started a new stage of research to sketch out a proposed end to end pipeline, comprised of individual modules required to process the data, for example ‘classification’. For each module, we have looked at the different methods that could be used, and how they may differ for the different data sources. In practice, this means that we need a pipeline that takes the raw input data, processes it, and outputs item level indices which are required as inputs into a final production platform. One of the major obstacles with this pipeline is the product churn (the volume of products entering and leaving the sample). Methods to define suitable clusters of homogeneous products are seen as a way of solving this problem however they remain an open question in the international research at the moment. This presentation will touch on the modules required to create item level indices from big datasets, before focusing on the clustering large datasets into price indices (CLIP) approach developed by ONS as a way of solving the issues associated with high product churn.

Paving the way forward for a modern and responsive national statistical office 148 Susie Fortier, (Email) Statistics Canada, Ottawa, Canada Faced with today’s data revolution, many national statistical offices (NSO) are modernising their programs to be more efficient and responsive to users’ needs. A strong statistical research and development (R&D) program is one way to fuel evidence-based decision-making and a key enabler of modernisation. Our agency’s methodology R&D program is geared towards providing solutions to current issues and challenges, identifying or developing sound theoretical frameworks and exploring new areas which could be beneficial in the medium to long term future. This paper will present a high-level view of our current research priorities and recent achievements. In particular, concrete examples in the following five specific areas will be showcased: (1) Use of non-probabilistic data source in a scientifically rigorous framework; (2) Defining, measuring and communicating quality in a multi-source environment; (3) Access, privacy and confidentiality; (4) High-definition data and small area estimation; and (5) Use of machine learning to improve the statistical production process. The paper will also briefly discuss how the R&D program is evolving to address the ever-changing demand landscape, and how this impacts the forthcoming research needs.

Using web scraped data to verify Egyptian consumer price indices 136 Mina Gerges, (Email) CAPMAS, National Statistics Office of Egypt, Cairo, Egypt Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt, Cairo, Egypt The purpose of this paper is to provide an alternative ways of data collection for NSOs, also covers the manipulation and analysis of web scraped data by tracking the utilization of online prices across markets’ websites and cities in near real time. Recently, In Egypt many companies have been published several websites for e-commerce and one of these is souq.com owned by Amazon, Inc. which made scraping data more available and in general appeared what is called: Web scrapers which are software tools for extracting data from web pages. The growth of online markets over recent years means that many products and associated prices information can be found online and possible to be scrapable. The consumer price index is one of the official statistics which estimate constructed using the prices of a sample of representative items whose prices are collected periodically; so it’s one of the best examples in this sense: by replacing the scraping of e-commerce websites and websites which publish the currently prices of products to automatically collect prices for some products and services rather than physical visiting to stores to manually collect the prices. This offers a range of great benefits including: Reducing data collection costs, increasing the frequency of collection and products in the basket, and improving our understanding of price behaviour. This paper introduces a developed generic tool that automatically collects online prices, as “Scraped Data”, based on multiple Search Engines to crawler newest prices and e-commerce websites. The developed tool aiming to aid in data collection reduction costs process depend on big data analytics. Finally, the methodology of this paper is based on machine learning methods that can lead to the crawling of market data on the web, automatic price scraping and evaluation of scanned data.

On the Coefficient of Variation and its Inverse 081 Georgia Giamloglou, (Email), Athina Maleganou, (Email), Myrto Papageorgiou, (Email), Nikolaos Farmakis, (Email) Aristotle University of Thessaloniki, Thessaloniki, Greece In this paper we are studying the coefficient of variation of a continuous random variable and some other concepts, like its inverse (symbol ICv or Cv-1) or its square inverse (symbol ICv^2=q), e.tc. Basically, we try to develop the asymptotic sampling distribution of the inverse coefficient of variation ICV=CV^(-1). This distribution is used to infer statistically significant results for the coefficient of variation or the inverse coefficient of variation of a random variable X without making a hypothesis for the population distribution of the variable X. We are focused on same cases of random variables following specific distributions, dealing with the related parameters of those distributions. E.g. Our attention is mostly focused on extracting results (confidence intervals, hypothesis testing), for parameters of Gamma distribution, Weibull distribution e.tc. Some examples are given in order to illustrate the particulars of the behavior of the ICv and ICv^2, related with the above-mentioned concepts: Confidence intervals, testing hypothesis, etc. for the random variable X and so on.

ModernStats World: modernising official statistics through standards and sharing 076 Taeke Gjaltema, (Email) United Nations Economic Commission for Europe, International Organisation, Other Under the High-Level Group on the Modernisation of Official Statistics (HLG-MOS) of the United Nations Economic Commission for Europe, several standards and models have been developed. They are collectively known as the ModernStats Standards. The aim of these models is to have a common language and a common model to structure the modernisation of the production of official statistics within organisations. The standards and models can be used to identify, resolve and prevent duplication within an office but also facilitate sharing of information and statistical services between offices. ModernStats is further and foremost a sharing platform for producers of official statistics to share and collaborate.

Mobile device tracking and transportation mode detection. 146 Yvonne Gootzen, (Email), Marco Puts, (Email) Statistics Netherlands, Heerlen, Netherlands In cooperation with a Dutch telecom provider, Statistics Netherlands has been developing a model for the tracking of mobile devices. Using multiple observations of the device of an employee, a probability distribution of the traveled path can be computed. This article serves as a proof of concept for the proposed methods. The available data contains records of connections between antennas and the device. A separate data source contains properties of all antennas in the network. A signalling strength model translates the properties of each antenna into a probability distribution of the position of a device in a 100m by 100m grid cell. A Markov Chain Monte Carlo model uses these probability distributions as input in the form of observations. The proposed approach does not only result in an estimated travel path of the device, it also provides the estimation of transportation mode for each time step.

Promoting reproducibility-by-design in statistical offices 173 Sybille Luhman, (Email), Jacopo Grazzini, (Email), Fabio Ricciato, (Email), Matyas Meszaros, (Email), Konstantinos Giannakouris, (Email), Jean-Marc Museux, (Email), Martina Hahn, (Email) European Commission - Eurostat, Luxembourg, Luxembourg Because policy advice is becoming increasingly supported by data resources, National Statistical Offices (NSOs) need to leverage the use of new sources of (small or big) data to inform policy decisions. At the time where citizens’ demands for more trust in the public institutions are growing, it underpins the movement towards more open, transparent and auditable (verifiable) decision-making systems. Bearing in mind the reproducibility movement towards Open Science and best practices in the Open Source Software (OSS) community, it is expected that a greater openness, transparency and auditability in designing statistical production processes will result in improved quality of the analysis involved in decision- making as well as increased trust in the NSOs. Drawing upon the Transparency and Openness Promotion guidelines and the Reproducibility Enhancement Principles in computational science, as well as the recommendations to funding agencies for supporting reproducible research and the various calls for open and transparent (data and) algorithms in the field of statistics, we advocate for the following principles: Shared, Transparent, Auditable, Trusted, Participative, Reproducible, and Open (or, in short, STATPRO) to be also adopted by NSOs. These principles build not only on existing and new sources of data, but also on new methodologies and emerging technologies, and advance thanks to innovative initiatives. With increased availability of open data, new developments in open technologies and open algorithms, as well as recent breakthroughs in data science, it is believed that they can help improve current governance processes by enabling data-informed evidence-based decision- making and potentially reduce the bias, costs and risks of policy decisions. In this regard, Official Statistics should be accompanied by access to the data analysed whenever possible, the detailed metadata information, the underlying assumptions (models and methods) and also the tools (software) used to generate them. In the context of a “post-truth” society, the STATPRO principles present substantial promises for Citizen Statistics and e-Official Statistics, e.g. for the interaction between NSOs, data users and data producers. Indeed, they make it possible to provide the public with the ability to both perform the analysis and repeat it with different hypotheses, parameters, or data, hence translating policy questions into a series of well- understood computational methods and scrutinizing the final decision. In this contribution, we further emphasize the need for Official Statistics to go beyond current practice and exceed the limits of the NSOs and the European Statistical System so as to reach and engage with produsers – e.g. statisticians, scientists and also citizens. Through the adoption of some best practices derived from the OSS community and the integration of modern technological solutions, the STATPRO principles can help create new participatory models of knowledge and information production. We illustrate this trend through Eurostat recent initiatives.

Delivering Official Statistics as Do-It-Yourself services to foster produsers’ engagement with Eurostat open data 179 Jacopo Grazzini, (Email), Julien Gaffuri, (Email), Jean-Marc Museux, (Email) European Commission - Eurostat, Luxembourg, Luxembourg This contribution aims at promoting new forms of production and dissemination for Official Statistics. We actually propose to align with the current best practices in other domains, e.g. the reproducibility movement in computational science and the Open Source Software (OSS) community, so as to support new modes of production of e-Official-Statistics. In doing so, we further highlight the importance of capturing and sharing data as well as algorithms and software to engage in . In practice, today's technological solutions – e.g. flexible Application Programming Interface, lightweight virtualised container platforms, versatile interactive notebooks … – make the development and deployment of reproducible statistical workflows easy by supporting an approach where data and algorithms are delivered as portable, interactive, reusable and reproducible computing services. In a constantly evolving data ecosystem, the proposed approach supports new modes of production of e-Official- Statistics since perfectly configured and ready-to-use computing environments can be distributed with any newly published Official Statistics to the public. The dissemination of reproducible (and reusable) computational statistics platforms and services offers the prospect of actively and durably engaging the public with the Statistical Offices in the co-creation, design, implementation, testing and validation of statistical products. Indeed, not only it provides the users with the data, tools and methods to fully reproduce experiments, by rerunning or tweaking previous data analyses, it also allows them to “judge for themselves if they agree with the analytical choices, possibly identify innocent mistakes and try other routes”. We illustrate the discussion with practical tools and examples for accessing Eurostat data and metadata.

Breaking data silos and closing the semantic gap with Linked Open Data: an example with Eurostat data and metadata and statistical concepts 177 Aleksander Skibinski, (Email), Sybille LUHMANN, (Email), Jacopo GRAZZINI, (Email), Jean-Marc MUSEUX, (Email) European Commission - Eurostat, Luxembourg, Luxembourg Linked Open Data (LOD) is a term used to identify a set of principles for publishing ad interlinking structured data. In recent years, several EU initiatives have sought to encourage the development of LOD technology and the publication of data as LOD (e.g. the European Open Data Portal initiative). As the major provider of Official Statistics on Europe, Eurostat is also currently improving its agility in responding to new user needs by making it easier for Eurostat's statisticians and data analysts to search and integrate Eurostat data. One of the main obstacles to achieving these goals is the fragmented nature of Eurostat's current data architecture. Eurostat's data and metadata remain largely confined in separate data silos with a low degree of interoperability, which makes finding and combining data from different domains cumbersome. Through few use case applications, the benefit of LOD is demonstrated to enable: (i) statisticians and data analysts within Eurostat finding data that can be used to answer questions or analytical requests from users, (ii) users exploiting the links to other data sources to enrich their analysis of Eurostat’s data or discover new facts about these data. Besides modelling Eurostat’s data and metadata as linked data using well known ontologies, this paper also explores new models for knowledge representation, derived from statistical domain expertise, to allow a harmonised (and “linkeable”) description of statistical concepts and other ontological constructs.

Seasonal and calendar adjustment of daily time series 197 Sylwia Grudkowska, (Email) National Bank of Poland, Warsaw, Poland Although high frequency data, i.e. data observed at infra-monthly intervals, could provide valuable information to official statistics, they are rarely modelled due to numerous problems with their estimation. One of the most crucial ones is a proper identification of the various periodicities. High frequency series often include multiple types of seasonality and many other effects that make a distinction between the various periodic components but also between the trend-cycle frequencies and the annual frequencies troublesome. Another challenge is a high volatility of such data, which influence on an identification and modelling of the outliers, breaks and calendar effects. As the availability of high-frequency data is rapidly growing and no officially recommended method for seasonal and calendar adjustment of high frequency time series exists, there is a growing pressure on developing efficient procedures to estimate all seasonal patterns with different periodicities. This paper focuses on a modelling of the daily time series of the currency in circulation in Poland. This series is an important factor that influences the level of banking sector liquidity and is vital for conducting monetary policy. Therefore, there is a need for its proper modelling and forecasting. The aim of this paper is to improve the models currently used by the National Bank of Poland for the currency in circulation. For this purpose the experimental R routines developed by Jean Palate (the National Bank of Belgium) were used. These algorithms allow for an estimation of a model that contain any number of periodicities, as well as an automatic outlier detection and generation of regression variables corresponding to holidays. The decomposition of the series is performed in an iterative way using a canonical decomposition (SEATS), STL or X11.

Migrants and European welfare systems: what data tell us on their well- being and impact on health systems 184 Caterina Francesca Guidi, (Email), Gaby Umbach, (Email) European University Institute, Firenze, Italy Today, migration is one of the key issues in the international and European political as well as public debate. One of the most compelling challenges concerning the integration of migrants into receiving societies consists in the adaptation of national healthcare systems to migrants’ needs. Within the European Union (EU), member states (MS) differ hugely in terms of their healthcare provision models, contribution systems, and integration policies adopted towards foreigners. Differences in access to and use of healthcare systems by migrants from within the EU and those from outside the EU (i.e. third country nationals) are still considerable and further diversified based on migrants’ legal status. To analyse these differences between the traditional types of healthcare systems within the EU, it is necessary to establish and measure the systematic relationship between the costs and performance of healthcare systems, migratory care demand, and the migrants’ contribution to the MS political and economic systems. As recently reinforced globally by the United Nations in the development of Global Compacts on Migration and Refugees (2018) as in the initiatives like the Sustainable Development Goals (2015) or the Global Consultations (2010, 2017) ruled by the International Organization for Migration and World Health Organization, one of the strategic areas has been represented by the migrant health. Especially in Europe, it has been underlined that central should be the response of MS’s health systems as their sensitivity to immediate and longer-term health needs of migrants and refugees. At the same time, the debates about the pressures on MS’ healthcare systems are overshadowed by the conflict between ‘factual’, i.e. measurable, and ‘post factual’, i.e. perceived, realities. While the former might paint a not too negative picture of the performance of national healthcare systems in reaction to migration, the latter might depict a doomsday scenario that was to end in the collapse of national healthcare systems. To provide evidence-based insights into the topic for our case study, the United Kingdom (UK), we analyse the role of diverging data narratives and contrast ‘alternative truth scenarios’ in ‘Brexit-UK’ with the ‘real’, i.e. measurable, impact of migrants on the British healthcare sectors. The key aim of the paper is to juxtapose factual and perceived evidence in the given case and to elaborate on key methodological strategies how best to disentangle the two the two. Furthermore, we analyse the consequences on migrants’ well being and their health status in UK.

App for the “Time Budget” Survey 080 CARMEN GUINEA LETE, (Email), TERESA IBARROLA, (Email) BASQUE STATITICS OFFICE - EUSTAT (SPAIN), VITORIA-GASTEIZ, Spain Eustat has been using browser-based electronic questionnaires to gather the information for the statistical operation survey on time budgets since 2008, when the questionnaire was primarily used as an IT support for pollsters in management and data collection control tasks. However, Eustat saw the need to extend the use of these online questionnaires to survey respondents, foreseeing that data collection via the Internet would become the preferred method in the near future. Therefore, in the 2013 campaign, the necessary changes were made to the questionnaire and those being surveyed were offered the opportunity to complete the questionnaire directly on the Internet. This experience was very positive, as analysis of the quality of this data compared to the data gathered through other methods, such as visits and phone calls, revealed that there were no significant differences between the two; the downside was that the percentage of people who had used this method was small, at 5.27%. During the 2018 campaign, measures to increase the direct completion of questionnaires and to boost the percentage of those using the internet to complete them were reinforced. The main measures included: • Improving the usability of the Web questionnaire, changing its design completely compared to the previous one, based on past experience and on the new resources that technological developments provide. • Automation of all the processes that can be automated, to lighten the burden of answering questionnaires. • The incorporation of multimedia information with explanations of how to fill in online questionnaires. • Implementation of a single point of access for both survey respondents and pollsters, called SARW, to improve access security and to guarantee data protection. • Development of a mobile phone app for both Android and IOS platforms, as a new experimental method of collecting data directly from the survey respondent, including interconnection with the web survey, so that both methods can be used interchangeably to complete it. • Implementation of an app that manages the survey process from beginning to end, including “hot” codification and validation, known as GEDW. The purpose of the proposed paper is to showcase both the design and performance of the app developed at Eustat which was created on 9 April this year for the Survey on “Time Budgets” as a pilot project for the short-term standardization of the new method of survey collection using mobile devices (smartphones and iphone). This app has been set up to work on both Android and Apple smartphones. Furthermore, the design used in the web surveys and the design used in the app surveys will be compared due to the differences of the electronic devices used in the two mediums.

Integration of European data about globalization into the French Statistical Business Register 022 Olivier Haag, (Email), Isabelle Collet, (Email) INSEE, Paris, France What it is important for Statistical Business Registers (SBRs)? Its main purposes are to provide a solid and reliable infrastructure, genuine basis for business statistics. To improve the quality of business statistics, the European Statistical System (ESS) set up the European System of Interoperable Statistical Business Registers (ESBRs) project, to tackle issues as the inconsistencies in SBR processes due to the absence of common infrastructure for linking and sharing SBRs’ information. ESBRs already provided important outputs, such the upgraded EuroGroups Register (EGR) 2.0 recording the multinational groups implanted in Europe and the setting up of the Interactive Profiling Tool (IPT) prototype, for collaborative European Profiling. The French SBR consists of a Business Register (BR) network. Actually, the core French SBR called SIRUS is fuelled by 3 so-called “Authentic source BRs” each dealing with one type of statistical unit. Thus, one BR called SIRENE handles the legal units, another called LIFI handles the groups and finally BCE handles the enterprises. Currently, the French SBR provides groups’ data to EGR, according to the EGR production cycle, EGR production is managed by Eurostat team. The French SBR also provides group and enterprise data to the IPT according to Grant Agreement (GA) between the French National Statistical Institute (NSI) and Eurostat. In light of EGR and IPT progresses, the French NSI intends to integrate into the French SBR, data from the EGR and from the IPT. The aim is to increase the quality of the French SBR. More specifically, integration of data from EGR will provide a better view of the foreign structure and nationality of multinational group implanted in France. Moreover, the IPT will provide valuable information about French enterprises belonging to foreign groups. It will especially allow French NSI to improve the delineation of enterprises within groups that are automatically profiled. The article will set out in a first part the French SBR network production process, then in a second part, the quality improvement due to the input of EGR and IPT data into the French SBR will be explained.

Functional geographies through the R package LabourMarketAreas 156 Daniela Ichim, (Email), Luisa Franconi, (Email) Istituto Nazionale di Statistica, Rome, Italy National Statistical Institutes are requested to provide increasingly more detailed information reflecting the underlying structure of the society at which policy decisions need to be focused. There is a growing request not only for higher geographical detail related to administrative boundaries, but also for meaningful statistics for the same geographical areas. Labour Market Areas (LMAa) have long being recognised as relevant for assessing the effectiveness of local policy decisions in labour-related matters. Am LMA is a functional geographic area defined for purposes of compiling, reporting and evaluating employment, unemployment, workforce availability and related topics. LMAs are generally intended as areas built on the basis of commuting to work data so that the majority of the labour force lives and works within their boundaries. The LMA delineation is a complex process that deals with commuting data availability, clustering algorithms and spatial statistics concepts. Istat has developed the R package LabourMarketAreas, Ichim et al. (2018), to ease the LMA production process. This paper illustrates the main features of the R package LabourMarketAreas.

Urban big data as innovation platform in smart city context – Case Helsinki 185 Laitinen Ilpo, (Email) City of Helsinki, Helisinki, Finland We are in the midst of a new economic age, a complex competitive landscape defined largely by globalization and digitalization. That means that the utilization and production of knowledge and innovativeness have become critical to organizational survival. (Uhl-Bien – Marion - McKelvey, 2007). That development has had its impacts also to urban develop-ment. Urban development, performance and competitiveness are seen depending on the availability and quality of knowledge. (Caragliu - Del Bo – Nijkamp 2011.) Smart city has been in discussions and became fashionable especially after year 2010, but is still somewhat fuzzy or not clearly defined. Smart city has been used even to refer cities which do not have clear strategies or processes supporting that. (Dameri - Cocchia, 2013.) The smart city concept originated from that of the ‘information city’, but has now much broader and deeper scope. As in the case of City of Helsinki the term ‘ubiquitous’ was widely in the use in the early phase of the smart city concept. That term ubiquitous was derived from ‘ubiquitous computing’. And thus on that very early stage of smart city –concept development the dominant thinking was about defining services via integration of IT. (Lee - Phaal - Lee, 2013.) What has changed is the exponential rise in the volume of raw data available to us. Big data phenomena within urban context is growing exponentially among public sector organ-isations. There is no precise and exact definition of big data. There is no numerical or quantified definition for big data and it has be seen more as massive and typically very complex sets of information. The concept has referred to massive, voluminous amount of data and secondly to the analysis of that data. Despite of the variations of the definitions they all have at least one of the following assertions common: the massive size of the da-taset, the structural complexity of the data and the technologies are needed and used to analyze the datasets. The quantitative and computational techniques’ side of big data has over 40 years history, but what is now changing and creating the real revolution is how we are using big data. (Mayer-Schönberger & Cukier 2013, 6-9; Ward & Barker 2013; Barnes 2013.) As Michael Batty noted what is needed is a new theory, since data without theory is not sufficient (Batty, 2013). The theory need asks forth understanding of how to use analytics to improve e.g. public services. And thus it will also have a significant impact on how we collect, use and interpret data and how professionals and experts work. In a highly con-nected information society, learning and the process of acquiring new skills and knowledge are of fundamental importance and define whether we are active participants or passive observers in the digitalisation process. Importantly, it is not just about learning new skills but unlearning old ones. The challenge for us is to develop new adaptive com-munities and working methods and to ensure that people embrace change while maintain- ing and gradually transitioning away from old practices. (Laitinen et al., 2017.)

Data integration methods and tools in the ESBRs 035 Razvan Cristian IONESCU, (Email), Enrica MORGANTI, (Email) Eurostat, Luxembourg, Luxembourg The European system of interoperable Statistical Business Registers (ESBRs) project is one of the ESS 2020 Vision Implementation Projects aimed at improving the quality of statistics in the EU. In the ESBRs, Eurostat and the European Statistical System (ESS) partners cooperate by exchanging and integrating micro data on legal units, control relationships between legal units and enterprises to achieve a complete view on the structure and activities of multinational groups operating in the EU. The need to exchange statistical information on multinational groups comes from the fact that each national statistical office alone is unable to derive a complete and correct picture from its national administrative sources. They can observe only a 'truncated' view of the multinational groups for the legal units that are resident on their territory and some cross border relationships, while information about non-resident legal units and the control chain outside its territory is usually not reachable. The integration process is managed centrally at Eurostat and takes place in the EuroGroups Register system (EGR 2.0 [1]). The national statistical business registers send their input data that cover EU legal units, while other commercial sources are used to cover non-EU legal units. In addition, national statisticians in different NSIs can improve the results automatically generated by the EGR. In particular, they further integrate statistical information obtained by manually profiling some of the largest and most relevant multinational groups in the EU. The result of European profiling is subsequently integrated into the EGR and constitutes another important pillar of the whole integration process in the ESBRs. This step of the ESBRs data integration process generates additional challenges because the view adopted in European profiling is top down, while the EGR integration process works bottom up. The output of this integration process is a statistical frame, called the EGR global frame, containing the consolidated legal structure of the multinational groups in the EU and their statistical units. The EGR global frame is sent back to the national statistical institutes (NSIs). The feedback can be used at national level to be integrated back into the national statistical business registers to improve and complete their partial view on the multinational groups. Ultimately, the EGR global frame should function as the coordination tool for all ESS statisticians to improve the quality and consistency of data measuring the activities of the multinational groups across the EU.

Crowd-sourcing: Integrated data collection systems: Palestinian statistical business register, obstacles and methodologies 078 Osaid Ismail, (Email) Palestinian Central Bureau of Statistics, Ramallah, Zimbabwe Abstract KEYWORDS: Administrative Business Register, Integrating data, coverage ,quality, statistical business register, unique ID, matching process 1. INTRODUCTION The Administrative Business Register considered as one of the most important examples of the Integrating data from several sources that contains different forms of data, in terms of coverage and quality. And that is because of the difference in registration methodologies and main purpose of recording for each source. In the Palestinian case PCBS build a preliminary version of the register with collecting data from all sources that register the economic establishments in Palestine, regardless of the registration methodology; Ministry of National Economy, Chambers of Commerce, Municipalities and Ministry of Finance. by matching the census data that collected by the National Statistical System, PCBS will get the so-called the statistical business register, that use sampling frame in Surveys. The Integration between all sources by matching the data of each source to the "backbone" file, is considered the Palestinian case whish is also "Ministry of National Economy" file because this file gives the "unique ID" that required in matching process. 2. METHODS matching all files from sources will lead to build the Administrative Business Register, and will be the first core to standardize model for all partners that are create the registry. 3. METHOD RISKS matching process is encountered by several obstacles and difficulties; such as poor data quality, as the fact that most of sources have a rubbish data and there is no unified identification number to be matched according to it.

Looking for immigrants in the European Labour Force Survey and the EU census: a comparison based on the 2011 figures 028 Georgiana Ivan, (Email), Mihaela Agafitei, (Email) Eurostat, Luxembourg, Luxembourg An increase in labour mobility in the EU, coming both from other Member states and from countries outside its territory, creates a need for detailed and up-to-date statistical information to feed current political discussions on labour market policies in general and on labour mobility in particular. The two main data sources for obtaining information on the topic are the EU Labour Force Survey and the Population and Housing Census, supplemented by yearly administrative population data. The objective of this paper is to compare the data obtained from the EU-LFS and the EU Census regarding characteristics of the foreign-born population residing in the EU. Combining these two data sources can bring new opportunities for analysing this group. The purpose is therefore to provide a clear overview of the differences and similarities when comparing the results coming from the two data sources for the same reference year, 2011. Potential explanations for the differences identified are also put forward. The main question is whether or not it would be possible to update the more structural information in the EU Census with the more timely data available in the EU-LFS when publishing statistics about the foreign born population. The conclusion, after analysing the data, is that the current situation is not favourable to combining the two data sources for getting more frequently estimates for foreign-born population. Despite the high power of EU-LFS to capture the foreign-born population important weakness arise when this population is broken down by some of the dimensions analysed in this paper, and in particular those that are of high relevance for looking at their labour market situation. The recommendation is to look for solutions to substantially increase the power of capturing foreign-born population in a sample so that it can be analysed by its multiple facets, and to collaboratively look for best practices for better collecting this population that is of very high relevance in a policy perspective.

GDP Flash Estimates for Germany 107 Xaver Dickopf, (Email), Christian Janz, (Email), Tanja Mucha, (Email) Federal Statistical Office Germany (Destatis), Wiesbaden, Germany In spring 2016, Eurostat began publishing a first so-called preliminary flash estimate for the quarterly GDP of the European Union and the euro area within 30 days after the end of the quarter. This European GDP flash is currently based on national data of 17 member states that cover 94% of the GDP of the euro area and 90% of the GDP of the European Union (as of May 2018). The data for Germany are provided by the Federal Statistical Office that has an experience of more than 15 years in conducting GDP flash estimates. This paper presents the German contribution to the European preliminary flash estimate at t+30. It shows the way from the first study on the feasibility of a quarterly German flash estimate beginning in 2002 towards the current situation of the flash estimate within the calculations of the German and the European GDP. The German GDP flash estimate can be characterised as a three-pillar- approach that consists of • an econometric calculation, • an experts’ calculation, • the reconciliation of the econometric and the experts’ calculation. All three pillars of the German GDP flash estimate are discussed with a focus set on their properties and their risks and opportunities. The paper ends with the quality of the German flash estimates at t+30 days and the way forward.

Instant Access to Microdata 187 Johan Heldal, (Email) 1, Ørnulf Risnes, (Email) 2, Svein Johansen, (Email) 1 1 Statistics Norway, Oslo, Norway 2 NSD - Norwegian Centre for Research Data, Bergen, Norway Norway has a large number of registers on individuals that have been established for administrative and statistical purposes, covering the entire population or significant subpopulations. The merged registers are used for production of statistics and represent a valuable source of data for research. Trusted researchers in approved research institutions have been able to apply for access to the data at their own site. The approval procedure is complicated as well as time- and resource demanding. The entering into force of GDPR this May, makes the process even more comprehensive. There has been a desire to simplify the procedure and at the same time make it safer through remote access and other measures. In 2012 the Norwegian Research Council funded a (then approx. four million €) project Remote Access Infrastructure for Register Data (RAIRD) aiming at creating an analysis server for easier and safe remote access to register data. The project is a joint venture where the grant was divided equally between NSD – Norwegian Centre for Research Data and Statistics Norway. Among the conditions for the project were 1. Online Remote Access (RA) 2. Micro data are invisible, only statistical output will show. 3. Users should be allowed to combine data from different sources. 4. All statistical results should be confidentially safe. In March this year the RAIRD technology was made operative in the research data service microdata.no.

Multi-source data integration with SBR 207 Steen Eiberg Jørgensen, (Email) Statistics Denmark, Copenhagen, Denmark Statistics Denmark’s (SD) Statistical Business Register (SBR) is very closely linked to an Administrative Business Register (Tax ABR) as well as a Central Administrative Business Register (CBR) administered by the Danish Business Authority. Together, the three institutions own and develop CBR. SD is mentioned as a data partner in the Central Business Register Act. The CBR and SBR are covering units from all institutional sectors.

IST - Data integration metadata driven concept 055 Branko Josipović, (Email), Mira Nikić, (Email), Siniša Cimbaljević, (Email) Statistical Office of the Republic of Serbia, Belgrade, Serbia Data processing and process automation, not only in NSIs, but also everywhere, always consist of three main phases: data entry, data editing and finally data processing with reports generation. Our idea of new concept and approach was based on the needs of a development environment in which there would be no programming, in which the applications would be uniformed, standardized, easy to create and independent of constant technological changes (new development tools, programming languages, platforms). Furthermore, the need was to have all data on the same platform, on the same type of format or the same type of relation database. This was our way to improve the IT support to statistical production, and also to develop the system that can be adaptable for all potential users. The concept is worth of knowledge sharing, and by sharing it provides a value user chain.

Study on using different modes as a new techniques in CAPMAS's economic census data quality 140 Said Kamal, (Email) Central Agency for Public Mobilization And Statistics CAPMAS, Cairo, Egypt 1. INTRODUCTION: The growing use of smart phones is transforming how people communicate. It is now ordinary for people to interact while they are mobile and multitasking, using whatever mode-voice, text messaging, email, video calling, and social media-best suits their current purposes. People can no longer be assumed to be at home or at their work place when they are talking on the phone, if they are willing to talk on the phone at all as opposed to texting or using another asynchronous mode of communication. And they may well be doing other things while communicating more than they would have been even a few years ago. From the other hand, the smart phones have proven its success worldwide in data collection. Based on that, Egyptian government decided to use smart phones as a new tool in economic census 2018, longstanding quality assurance and practices of data quality via telephone interviewing pre-test are being challenged. 2. METHODS & MATERIALS: In the study reported here, 634 people who were the owners or the mangers of the 634 of establishments, they had agreed to participate in an interview on their Phones were randomly assigned to answer 32 questions from the main questionnaire of Egypt economic census 2018. Text messaging or speech, administered either by a human interviewer or by an automated interviewing system. 10 interviewers from the CAPMAS’s quality control department administered voice and text interviews; automated systems launched parallel text and voice interviews at the same time as the human interviews were launched. 3. EXPERIMENTAL DESIGN: The experimental design contrasted two factors, interviewing medium (voice vs. text) and interviewing agent (human vs. automated), creating four modes in a 2x2 design (see Fig 1). The four modes were implemented to be as comparable to each other as possible so as to allow clean comparisons between voice and text as well as between human and automated interviewing agents. 4. RESULTS: Texting led to higher quality data-fewer rounded numerical answers (see fig2), more differentiated answers to a battery of questions, and more disclosure of sensitive information-than voice interviews, both with human and automated interviewers. Text respondents also reported a strong preference for future interviews by text. 5. CONCLUSIONS: Our data do not lead us to argue that all interviews should now be carried out via text messaging or by automated systems. There are likely to be subgroups of the population who would rather not text, and who prefer to speak to a human. Good automated systems have serious development costs (particularly speech systems), which may make them better suited for longitudinal studies where the development costs are amortized, as opposed to one-off or underfunded surveys. 6. REFERENCES: • DUGGAN M (19 SEPT 2013) CELL PHONE ACTIVITIES 2013. PEW INTERNET AND AMERICAN LIFE PROJECT. • Conrad FG, Brown NR, Dashen M (2003) Estimating the frequency of events from unnatural categorie

Developer and method services in the UN Global Platform 102 Joni Karanka, (Email) Office for National Statistics, UK, Newport, United Kingdom Governments and their statistical offices face global challenges. A good example of a challenge at this scale is the development of indicators to support the Sustainable Development Goals in an age in which data is fast, big and heterogeneous. Similarly, most official statistics are published to agreed international standards. This situation has led to statistical offices making overlapping and, at a global scale, redundant investments, while not leveraging their combined resource pool of knowledge, infrastructure and capital. We present two services of the United Nations Global Platform: the developer service, and the methods service. Both are live, accessible services, that make use of modern architectural patterns and technologies in order to facilitate the work of the global statistical community. The developer service allows data collaboratives to explore data and developed code and algorithms; while the method service allows users to easily publish and consume statistical methods and algorithms using a microservice pattern. In this session we discuss: a) how these services meet needs of the global statistical community and make initiatives such as CSPA easier, b) what the services have achieved and demonstrated this far, and c) what have we learned by deploying, managing and using the services.

Transparency and reproducibility of models and algorithms: examples from the UN Global Platform 176 Joni Karanka, (Email), Joe Peskett, (Email) Office for National Statistics, UK, Newport, United Kingdom Public administrations, organisations and citizens depend on evidence from the statistical community to take informed decisions. The impact of the evidence and analytics that the community provides deeply depends on the public trust that we are able to generate, and is directly in competition with other sources of information. Currently we face three immediate threats to our trust: a) a media environment in which unchecked facts are widespread in social media (‘fake news’), b) a data science and technology evolution towards large, unstructured datasets and less transparent algorithms and models, c) the ‘closeness’ of many statistical datasets due to personal or commercial concerns. Shortcomings on the transparency of algorithms and data led to the ‘replicability crisis’ in social sciences (see the Reinhart & Rogoff error in economics); which we argue that the official statistics community is not immune to. Two of the main drivers of the UN Global Platform are the provision of trusted methods and trusted data. The provision of trusted methods, algorithms and models depends on a number of factors, such as: a) openness of the code-base, b) openness of the data (and training data), c) description of the logic of the algorithm, d) reproducibility of the algorithm by other researchers. We argue that the provision of a description of the methodology and availability of the code are the minimum expected from national statistical organisations, but that efforts should be made to provide fully reproducible algorithms. To further enhance trust, we will provide examples of how UN Global Platform algorithms have been made more transparent by: a) provision of a public endpoint and execution for the algorithm, so that researchers can apply it to data they are familiar with, b) synthetic datasets to explore the workings of the algorithm / model, and c) notebooks with demonstrators of the algorithm in publicly available environments so that citizen users can understand the workings and implications of the algorithm.

A method for minimizing the residual term of the decomposable Gini index. 127 Eleni Ketzaki, (Email), Nikolaos Farmakis, (Email) Deparment of Mathematics, Aristotle University of Thessaloniki, Thessaloniki, Greece Abstract The Gini index of income inequality for grouped data, is decomposed into the between groups and the within groups inequality that arises between the subgroups. In case that the subgroup income ranges overlap, the decomposition of the Gini index obtains as the sum of three terms, the between groups index, the within group index and the residual term. In this study we propose a method for reducing the value of the Gini’s index residual term, in case of overlapping. The proposed method concerns the representation of Gini index as a matrix product. We study both the case of large subgroups and the case of small samples as well. Τhe accurate calculation of the Gini Index based on data grouped by categories, commonly arises with income data, that are usually grouped for confidentiality purposes. 1. INTRODUCTION A decomposable inequality measure is defined as a measure such that the total inequality of a population can be broken down into the inequality that existing within subgroups of the population and the inequality that existing between subgroups. Gini index of inequality can be expressed as a decomposable measure but in case that the subgroup income ranges overlap, the decomposition of the Gini index obtains as the sum of three terms, the between groups index, the within group index and the residual term. Our main goal in this study, is to propose a correction that reduces the value of the residual term. The correction concerns the inequality between groups and the inequality within groups. Τhe impact of the methodology is examined for both large and small subgroups. The second section describes the decomposition of the Gini index that obtains as a matrix product and the proposed methodology that leads to the correction of the Gini index while it reduces the value of the residual term and the small error bias. The third section contains the simulation results of the proposed method as well as the results based on official data. We calculate the residual term before and after the correction for small and large subgroups and we compare the results regarding the value of the residual term. We also calculate the standard error and the corresponding confidence intervals. The last section describes the conclusions and the contribution of the proposed method to the simulation data and to the official data

Implementing Big Data in Official Statistics: Capture-recapture Techniques to Adjust for Underreporting in Transport Surveys Using Sensor Data 059 Jonas Klingwort, (Email) 1, 2, Bart Buelens, (Email) 3, Rainer Schnell, (Email) 2 1 University of Duisburg-Essen, Duisburg, Germany 2 Statistics Netherlands (CBS), Heerlen, Netherlands 3 Vlaamse Instelling voor Technologisch Onderzoek (VITO), Genk, Belgium Producing unbiased estimates in official statistics based on survey data becomes more difficult and expensive. Accordingly, research on methods using big data for the production of official statistics is currently increasing. Up to now, big data is rarely used in statistical production due to its unknown data generating process. However, in the long-term, using big data in official statistics is unavoidable. Therefore, instead of using single big data sources, research on combining different probability and non-probability based datasets is a promising approach to use big data in official statistics. More specifically, the different problems of surveys and big data might be minimized if the survey and the sensors measure the same target variable and the resulting micro data can be combined with a unique identifier. Using this principle, we link survey, sensor, and administrative data for transport statistics. Using the linked dataset we apply capture-recapture techniques to validate, estimate and adjust a bias due to underreporting in the target variables of the survey.

Smart Meter Data as a Source for Official Statistics 109 Karin Kraft, (Email) 1, Ingegerd Jansson, (Email) 2 1 Statistics Sweden, Örebro, Sweden 2 Statistics Sweden, Stockholm, Sweden In recent years, electricity smart meters have generated a lot of interest from producers of official statistics. The smart meters read electricity consumption and production from a distance and record data at short intervals. Large amounts of data, not only measurements but also attributes of metering points, customers, and providers, are collected in a common database (a hub). The data have an obvious attraction for official statistics producers. A Swedish data hub is under development and is planned for the end of 2020. The hub will contain data for more than 5 million metering points. Since a few years, Statistics Sweden is recognized as a stakeholder. In additon, Stat Sweden participated in the recent project ESSnet Big Data. In our paper, we describe the work we have done so far to prepare for the electricity hub as a possible source for relevant statistics on a number of topics, such as electricity use, environment and buildings. During preparation for the hub, we have access to a small set of test data and present some results based on this data.

Estimating regional wealth in Germany: How different are west and east really? 100 Ann-Kristin Kreutzmann, (Email) 1, Philipp Marek, (Email) 2, Nicola Salvati, (Email) 3, Timo Schmid, (Email) 1, Sylvia Harmening, (Email) 1 1 Freie Universität Berlin, Berlin, Germany 2 Deutsche Bundesbank, Frankfurt am Main, Germany 3 University of Pisa, Pisa, Italy The increasing inequality of private income and wealth requires the redistribution of financial resources. Thus, several financial support schemes allocate budget across countries or regions. One compelling example in this context is the promoted catching-up process of East Germany after the German reunification. However, it is questionable if 25 years after the reunification differences only occur between the East and West or if an analysis at a lower regional level reveals a more diverse picture. In order to provide a data source for the estimation of private wealth, the European Central bank launched the Household Finance and Consumption Survey (HFCS) for all euro area countries in 2010. This work shows how to receive estimates based on the HFCS for low regional levels in Germany by means of a modified Fay-Herriot approach that deals with a) the skewness of the wealth distribution using a transformation, b) unit and item non-response, especially the used multiple imputation, c) inconsistencies of the regional estimates with the national direct estimate. Although the paper focuses particularly on Germany, the approach proposed is applicable to the other countries participating in the HFCS as well as to other surveys with a similar data structure.

Small area estimation for the Dutch investment survey 050 Sabine Krieg, (Email), Joep Burger, (Email) Statistics Netherlands, Heerlen, New Caledonia Traditionally, national statistical institutes prefer design-based estimation methods like the general regression estimator. However, there is more demand for detailed gures, for example on a regional breakdown, but the sample size is too small to compute reliable gures using design based methods. Model based small area estimation (SAE) can be a good alternative. We investigate this for the Dutch investment survey, with investments in tangible xed assets as target variables, for municipalities as areas. As usual for surveys for enterprises, a stratied sample design is applied, the strata are based on the economic activity and size class, based on number of working persons. There are two interesting aspects in this data. The target variable is zero for almost 50% of the sample, as many enterprises do not have investments every year. Furthermore, the distribution of the non-zero values is right-skewed as few enterprises have very large investments. We compare SAE methods which take the properties of the data into account with SAE methods which ignore these properties, and with a design-based method. It is found that SAE can indeed improve the accuracy of the estimates. However, not all model specifications work well. The selection of the optimal specification is a subject for further research.

Using Online Job Vacancies to Create Labour Market Intelligence 181 Vladimir Kvetan, (Email) Cedefop = European center for the development of vocational training, Thessaloniki, Greece Over the last decade, the usage of internet for posting job vacancies has significantly increased. Although initially the job vacancies posted online were predominantly for highly skilled ICT workers due to the widely spread internet access and increased ICT literacy of citizens, these platforms contain job offers of different types and levels. The online job vacancies (OJV) have therefore become a rich source of information about skills and other requirements by employers which are hardly to be gathered via traditional methods. This information has strong potential to augment labour market actors’ understanding of actual dynamics of skills demand, allowing them to make better career and policy choices. Ultimately, the data collected from OJVs allow identifying skills and job requirements typically requested across occupations as well as new and emerging jobs and skills. This information has a potential to fill an important gap in the EU evidence on skills demand of employers. The information retrieved can complement conventional sources and provide policy-makers more detailed and timely knowledge on the labour market trends. It can also help employment services, guidance counsellors and learning providers to better target their services, as well as individuals who decide on their careers and skills development. This abstract presents key elements of a pan EU system for gathering and analysing data contained in OJVs. Aiming at investigating the full potential of this new information source and being aware of mentioned limitations, it will describe key steps undertaken to develop the system, including key questions which the data are able to answer.

Using the Business Process Model and Notation (BPMN) standard for the automatisation of survey fieldwork within an integrated data collection systems for social surveys 110 Josef Kytir, (Email) Statistics Austria, Vienna, Austria Although administrative data and registers are used more and more, data provided by respondents directly remain important in social statistics. As in other areas of statistical production, the technical, organisational and social environment for survey statistics changed a lot over the last decades. Quality standards increase whereas a considerable decrease in participation rates could be observed in many countries. One strategy to deal with these challenges is to use more complex survey designs, including mixed-mode designs, sophisticated contact strategies and adaptive approaches. However, combining these strategies and approaches in a single survey results in rather complex business processes during the data collection phase of a survey. Running several such surveys simultaneously poses a considerable management burden on data collection units in Statistical Offices. During the last decades much time and effort was spent to develop electronic tools for designing survey questionnaires. Despite the growing importance there is obviously much less attention paid to develop adequate solutions for case management systems. However, automated case management systems become essential to run surveys in an efficient and cost effective way. In addition to efficiency and cost effectiveness National Statistical Institutes have to consider other quality dimension too. Therefore, you end up with the demand for a tool combing as far as possible the automatisation of the data collection process with the feasibility for non- automated, individual case management. This non-automated case management should allow fieldwork management staff to overrule or complement the automated processes to be able to react on individual preferences of respondents as well as of interviewer in case of surveys including CAPI as a data collection mode. In 2013, Statistics Austria started developing a new integrated service infrastructure called STATsurv for running social surveys. From the beginning of 2018, all social surveys including the Labour Force Survey and EU SILC are carried out successfully using the new service infrastructure. An essential element of STATsurv is the integration of automated and non-automated case management in a comprehensive survey management tool. The presentation will focus on this specific aspect.

Weighting adjustments using micro and macro auxiliary variables 056 Seppo Laaksonen, (Email) University of Helsinki, Helsinki, Finland University of Helsinki, Helsinki, Finland The weighting is necessary to use for sample survey estimation. This paper presents the basic principles of the simple and more advanced weights and some examples. All weights require auxiliary data. We here present these as two main categories, either macro or micro. Our recommendation is to use both meaning that these need to be obtained. Two obstacles are met: One is a general understanding of their importance. The second might to confidentiality if the data owner does not release such data. This might happen also due to the new EU regulation GDPR if not solved correctly

Estimation of monthly business turnover using administrative data in the UK 058 Paul Labonne, (Email) 1, 2, Martin Weale, (Email) 1, 2, 3 1 King's College London, London, United Kingdom 2 Economic Statistics Centre of Excellence, London, United Kingdom 3 Centre For Macroeconomics, London, United Kingdom This paper derives monthly estimates of turnover for small and medium size businesses in the UK from rolling quarterly VAT-based turnover data. We develop a state space approach for filtering, cleaning and temporally disaggregating the VAT figures, which are noisy and exhibit dynamic unobserved components. We notably derive multivariate and nonlinear methods to make use of indicator series and data in logarithms respectively. After illustrating our temporal disaggregation method and estimation strategy using an example industry, we estimate monthly seasonally adjusted figures for the seventy-five industries for which the data are available. We thus produce an aggregate series representing approximately 60\% of gross value added in the economy. We compare our estimates with those derived from the Monthly Business Survey and find that the VAT-based estimates show a different time profile and are smoother.

Use of Sentinel 1 and 2 data to assess crop area in Poland 062 Artur Łączyński, (Email) Statistics Poland, Warsaw, Poland In recent years, the Copernicus program enabled the free access to vast inventories of high resolution data. The satellites Sentinel 1 and 2 (S1 & S2) provide the optical and radar data each 4-5 days with the 10 – 20 m spatial resolution throughout the whole globe. Availability of such massive data source strengthens queries to reduce respondent burden and obtaining statistics faster and on the lower aggregation level. Since the beginning of the Sentinel mission, Statistics Poland started thorough work on crop area estimation using satellite data. Images in time series were processed and classified according to crop areas. In the pilot survey, several major crops were included such as cereals, rape, potato. The crop recognition method based on use of machine learning (the support vector machine - SVM). The satellite data classification was assisted by in situ data collection and administrative data (The Land Parcel Identification System). The whole process of crop area estimation included several steps such as satellite data acquisition, preprocessing (segmentation, delimitation of objects with uniform areas), classification and validation, generalization for territorial units. The in situ information was used as a training sample in the classification and validation stage for SVM, while the administrative data served for preprocessing segmentation. The S1 & S2 data supported each other, where S2 has the ancillary role in the segmentation. The obtained accuracy of crop recognition appeared to be high reaching 96%. The pilot survey was carried out by the Agriculture Department of Statistics Poland in cooperation with the Space Research Centre (CBK PAN) in the frame of annual projects funded by Statistics Poland.

(In)Stability of Reg-ARIMA Models for Seasonal Adjustment 091 Dominique Ladiray, (Email), Alain Quartier-la-Tente, (Email) INSEE - SACE, Montrouge, France Regression models with ARIMA errors (Reg-ARIMA) are nowadays commonly used in seasonal adjustment to remove the main deterministic effects (outliers, ruptures, calendar effects) from the raw data before decomposing the corrected series into trendcycle, seasonality and irregular. The main seasonal adjustment programs (X-13Arima-Seats, Tramo-Seats, JDemetra+) implement these models in an automatic and very user-friendly way. This facility hides in fact a real complexity and in certain cases a lack of robustness which can escape the user. In the presentation, we draw attention to the real difficulties of implementing these models through concrete cases: the estimation of a leap year effect, the estimation of breaks and even the estimation of an ARIMA model.

ScattR – A Shiny App for Exploratory Data Analysis 010 Philipp Leppert, (Email) Federal Statistical Office Germany (DESTATIS), Wiesbaden, Germany Any statistical data analysis aiming to answer scientific questions, derive solutions to given problems, or make predictions about unknown outcomes, should be preceded by an intensive exploration of the (initially unknown) data structure. However, there is always a trade off between the time available to examine the data and the time needed to develop sophisticated modeling approaches for an in-depth analysis. Particularly, this applies when datasets are large, carry many features of interest and inherit a hierarchical structure. Using the software R and the package shiny we created ScattR, which offers a user-friendly interface to compensate time usually consumed by coding in Statistical Software for an interactive data exploration with the scatterplot.

An improved multiobjective approach of temporal disaggregation, a special case of multivariate Denton 024 Gábor Lovics, (Email) Hungarian Central Statisitical Office, Budapest, Hungary Reconciliation is a problem where more than one time series need to be temporally disaggregated, while temporary and cross-sectional conditions should be hold. One of the well- known solutions of the problem is the Multivariate Denton Method, which disaggregates the low frequency time series while using high frequency auxiliary indicators for each of the time series. This method is a convex quadratic optimization problem, where the objective function is equal to the sum of some functions which belong to the time series. The problem can be formulated as a multiobjective optimization problem, where all these functions try to be minimized simultaneously instead of the minimization of the sum of the functions. The aim of this paper is to describe how some results from the multiobjective optimization can be used to develop the Multivariate Denton Method. The optimal solution of the multiobjective optimization problem is called Pareto-optimal solution, where none of the objective functions can be decreased without increasing some of the others. One way to find a Pareto-optimal solution is to start from a feasible solution, and decrease all the objective functions step by step, until reaching a Pareto-optimal solution. In this paper a new method to solve reconciliation problem, using the result we mentioned before, is described. This method is similar to the Multivariate Denton Method, which also finds a Pareto-optimal solution of the same multiobjective optimization problem, but typically the two methods find different solutions.

Refugees in undeclared employment - A casestudy in Turkey 031 Melina Ludolph, (Email) 2, 3, Till Koebe, (Email) 1, 3, Fabian Bruckschen, (Email) 1, 3, Timo Schmid, (Email) 1, Maria Francesca Marino, (Email) 4 1 Freie Universitaet Berlin, Berlin, Germany 2 Humboldt Universität Berlin, Berlin, Germany 3 Knuper, Berlin, Germany 4 Universita delgi Studi di Firenze, Firenze, Italy Exploitation of vulnerable groups such as refugees for cheaplabour in agriculture and in the construction sector is a notorious phe-nomenon in Turkey. Up to 2017, only 1.3% of the around 3 mn Syrianrefugees registered in Turkey have been granted a work permit, leavingthe overwhelming majority dependent on undeclared employment withall its negative implications: high-risk jobs, pay below minimum wage,lack of access to social security. Mobile phone metadata allow for a de-tailed view on commuting routines and migration, possibly unearthingemployment situations which are not captured otherwise. This study pro-poses a methodological framework for identifying potentially undeclaredemployment among refugees in Turkey within the current situation. Todo so, it includes an early proof-of-concept based on a Difference-in-Differences approach by analyzing seasonal migration and commutingpatterns in two specific cases: during the late-summer hazelnut harvestin the province of Ordu and at the construction site of the Istanbul GrandAirport. The study finds clear indication for work- related migration andcommuting patterns among refugees hinting at undeclared employment.The proposed framework therefore provides an analytical instrument tomake targeted interventions such as controls more effective by detectingsmall areas where undeclared work likely takes place.

Linking Open Data in the European Statistical System 166 Eoin MacCuirc, (Email) Central Statistics Office, Cork City, Ireland The aim of this paper is to outline the European Statistical System (ESS) Linked Open Statistical Data (LOSD) project vision, to look at progress in the project to date, to share key insights learned and to explore the publication pipeline and toolkit that has been developed. The LOSD project is being delivered in the context of the European Statistical System’s (ESS) DIGICOM project. DIGICOM contributes to two key areas of the ESS vision: (i) focus on users, and (ii) improve dissemination and communication. One of the four DIGICOM work packages is Open Data Dissemination. The ultimate objective of the Open Data Dissemination work package is to facilitate automated access to European aggregate data for heavy users or re disseminators and to improve access to micro data. The ESSnet on Linked Open Statistical Data was launched in November 2017. In September 2018 the ESSnet LOSD publication pipeline was unveiled at the Paris Hackathon. The paper will outline the steps of the LOSD publishing pipeline built by the ESSnet consortium; data conversions, data publishing, data linking and data visualisation and analysis. The toolkit will be explored in the context of the use cases selected at the Dublin Hackathon in February 2018. A number of data and metadata challenges which have been encountered will be highlighted. The possibility of what LOSD can achieve will be considered.

Regional analysis of business surveys: methods and applications in the context of Small Area Statistics 199 Julia Manecke, (Email) Trier University, Trier, Germany In recent years, the demand for results of business surveys broken down by region and content has increased significantly. Often, however, the sampling design of a survey is only designed for a reliable design-based estimation at the state or federal level due to maximum permissible sample sizes. If additional estimated values are to be determined for regional or content-related subpopulations, insufficient sample sizes might lead to unacceptably high variances of the estimates. A subpopulation in which the sample size is not large enough for a direct design- based estimation of sufficient precision is called a small area. So-called small area estimation methods can be used to calculate precise estimates for the respective subpopulations. These mostly model-based approaches rest on the supportive inclusion of additional auxiliary information from other subpopulations using a statistical model. However, the lack of timeliness of business registers usually used as a sampling frame for the design of business surveys or the time lag between the sampling design and the data collection itself lead to inconsistencies also referred to as frame errors. As a result, the variables contained in the business register might be obsolete. In addition, the register and the target population differ in terms of their composition and the number of businesses. Nevertheless, the register is a potential source of auxiliary variables for small area estimation methods. This however may create problems, as erroneous or obsolete auxiliary variables may cause small area estimation methods being even worse than classic design-based estimators. Furthermore, the sampling design of business statistics usually includes a stratification by industry groups and size classes. Due to the lack of topicality of the frame population and the strong dynamics of business populations, however, industry-specific and size-specific stratum jumpers may result. These are businesses that would have been assigned to a different design stratum if the correct design information had been available at the design stage. Accordingly, the assumptions under which the original design weights were determined within the sampling design are no longer applicable. Building on the challenges of business surveys elaborated above, the potential to improve the estimations for small areas in business statistics is analysed. In this context, the inconsistencies between the available frame population and the target population referred to as frame errors are considered in particular. On the one hand, various small area estimation methods are implemented and compared with one another with regard to their ability to improve the estimation quality despite outdated auxiliary information. On the other hand, as the assumptions under which the original design weights were determined within the sampling design no longer apply, various reweighting approaches are developed and evaluated. The comparison is made taking into account different frame error scenarios. Therefore, we examine the extent to which small area estimation approaches using outdated auxiliary information and various reweighting approaches can achieve an improvement compared to the classic design- based Horvitz-Thompson-Estimator in various realistic scenarios of stratum jumpers.

Big Data on Vessel Traffic: An Innovative Approach to Nowcast Trade Flows in Real-Time 104 Marco Marini, (Email), Serkan Arslanalp, (Email), Patrizia Tumbarello, (Email) International Monetary Fund, Washington, United States In this paper, we show that vessel traffic data based on Automatic Identification System (AIS) can be used to nowcast trade activity. We develop indicators on maritime traffic and trade based on port calls. We use Malta as a benchmark case. Data on port calls track arrivals and departures of vessels based on their positions at ports. We find that an AIS-based indicator tracks well Malta's official volume index of imports. Provided that the challenges associated with port calls data can be overcome through appropriate filtering techniques, these emerging “big data” on vessel positions could allow statistical agencies to supplement existing data sources on trade and introduce new statistics that are more granular (port-by-port) and more timely (practically real-time), offering an innovative way to nowcast trade activity in volume.

A Diagnostic for Seasonality Based Upon Autoregressive Roots 032 Tucker McElroy, (Email) US Census Bureau, Washington, DC, United States The problem of identifying seasonality in published time series is of enduring importance. Many official time series -- such as gross domestic product (GDP) and unemployment rate data -- have an enormous impact on public policy, and are heavily scrutinized by economists and journalists. Obscuring the debate is the lack of universally agreed-upon criteria for detecting seasonality, as well as the different behavior of seasonal patterns in raw versus seasonally adjusted data. We propose the following verbal definition of seasonality: persistency in a time series over seasonal periods that is not explainable by intervening time periods. For a monthly series with a seasonal period equal to twelve, seasonality is indicated by persistency from year to year that is not explained by month-to-month changes. Note that both parts of this definition are crucial: without seasonal persistency from year to year, no seasonal pattern will be apparent, so this facet is clearly necessary; however, any trending time series also has persistency from year to year, which comes through the intervening months -- we need to screen out such cases. If a time series is covariance stationary, it is natural to parse persistency in terms of autocorrelation. The paper shows that we can adapt persistency to non-integer lags of the autocorrelation function via its decomposition in terms of autoregressive (AR) roots, and examine seasonality of arbitrary frequency through the modulus and phase of the root. Whereas under-adjustment would be indicated by the presence of AR roots of near-unit magnitude and seasonal phase, over-adjustment corresponds to a negative form of persistency (i.e., negative seasonal autocorrelations) termed anti-persistency, and can be measured through moving average (MA) roots computed from the inverse autocorrelations, i.e., the autocorrelations of the reciprocal of the spectral density.

ARMA models with time dependent coefficients 099 Guy Mélard, (Email) Université libre de Bruxelles, ECARES CP114/04, Brussels, Belgium Two decades ago, effective methods for dealing with time series models that vary with time have appeared in the statistical literature. Except in a case of marginal heteroscedasticity [1], they have never been used for official statistics. In this paper, we consider autoregressive integrated moving average (ARIMA) models with time-dependent coefficients applied to very long U.S. industrial production series. There was well an attempt to handle time-dependent integrated autoregressive (AR) models [2] but the case study was small. Here, we investigate the case of ARIMA models on the basis of [3, 4, 5]. As an illustration, we consider a big dataset of U.S. industrial production time series already used in [6]. We employ the software package Tramo in [7] to obtain linearized series and we built both ARIMA with constant coefficients (cARIMA) and ARIMA models with time-dependent coefficients (tdARIMA). In these tdARIMA models we use the simplest specification: a regression with respect to time. Surprisingly, for a large part of the series, there are statistically significant slopes, indicating that the tdARIMA models fit better the series than the cARIMA models.

Aggregating flags – a standardised and rational approach 090 Matyas Meszaros, (Email) Eurostat, Luxembourg, Luxembourg A flag is an attribute of a cell in a data set that provides additional qualitative information about the statistical value of that cell. They can indicate a wide range of information, for example, that a given value is estimated, confidential or represents a break in the time series. Currently different sets of flags are in use in the European Statistical System (ESS). Some statistical domains use the SDMX code list for observation status and confidentiality status, OECD uses a simplified version of the SDMX code lists and Eurostat uses a short list of flags for dissemination which combines the observation and confidentiality status. While in most cases it is well defined how a flag shall be assigned to an individual value, it is not straightforward to decide what flag shall be propagated to aggregated values like a sum, an average, quantiles, etc. This topic is important for Eurostat as the European aggregates are derived from national data points. Thus the information contained in the individual flags need to be summarized in a flag for the aggregate. This issue is not unique to Eurostat, but can occur for any aggregated data. For example, a national statistical institute may derive the national aggregate from regional data sets. In addition, the dissemination process provides further peculiarity: only a limited set of flags, compared to the set of flags used in the production process, can be applied in order to make it easily understandable to the users. In the scientific community there is a wide range of research about the consequences of data aggregation but it concentrates only on the information loss during aggregation of information and there is no scientific guidance how to aggregate flags. This paper is an attempt to provide a picture about the current situation and provide some systematic guidance how to aggregate flags in a coherent way. Eurostat is testing various approaches with a view to well balance transparency and clarity of the information made available to users in a flag. From several options, 3 methods (hierarchical, frequency and weighted frequency) are implemented in an R package for assigning a flag to an aggregate based on the underlying flags and values. Since the topic has relevance outside of Eurostat as well, it was decided to publish the respective code with documentation with a view to foster re-use within the European Statistical System and to stimulate discussion, including with the user community.

Publishing georeferenced statistical data using linked open data technologies 034 Mirosław Migacz, (Email) Statistics Poland, Warsaw, Poland In January 2018 Statistics Poland concluded the “Development of guidelines for publishing statistical data as linked open data” project. The aim of the project was to perform a thorough inventory of data sources and investigate technologies, which could be used to publish georeferenced statistics as linked open data. Data samples from statistical databases and geospatial datasets have been selected for transformation to linked open data RDF triples and a dataset catalogue has been set up and encoded in RDF. A pilot triple store has been established with a SPARQL endpoint – a query interface. Aside from the pilot’s results being machine readable, all data created in the pilot was also internally published as human readable webpages using linked open data frontend software. The pilot linked open data implementation was a valuable exercise which provided a lot of answers but at the same time raised a lot of new questions: Is there a reference implementation for statistical data? Which vocabularies to use? What should we link to? How to encode geospatial data to make them most usable? Most implementations are technically correct but are they of good quality?

Modelling enterprises responses to the ICT (Information Communication Technology) survey 198 Noémie Morénillas, (Email) Insee, Montrouge, France Ensai, Bruz, Rennes, France In 2016, among the 28 countries of the European Union, more than three quarters (77 %) of enterprises with 10 or more persons employed answered that they are concerned by online visibility and had a website or a home page. This result comes from the community survey on information and communication technologies (ICT) usage and e-commerce in enterprises. It is a mandatory survey conducted annually since 2002 on enterprises with 10 or more persons employed. The aim is to collect data on ICT in enterprises to supply indicators about Internet activities, connection used, e-commerce or ICT skills. In France, this survey is conducted by the National Institute of Statistics and Economic Studies (Insee). Reducing respondent burden, using administrative data or innovative techniques are becoming increasing issues: “non- excessive Burden on Respondents” (principle 9, European Code of Practice) “harness new data sources” (European Statistical System’s Vision 2020), “yet official statistics are an area that is constantly innovating, in order to reduce the response burden on individuals and enterprises who are surveyed, to make full use of administrative data” (Insee Horizon 2025). This research proposed modelling to reduce respondent burden using auxiliary information like administrative data which is one of the objectives of National Statistics Institutes. The aim is to model enterprises responses to items asked each year and remove them from the French ICT survey while providing national aggregates required by Eurostat. To do this, adapted, available and quality auxiliary information is needed. Six dummy variables to be modelled were selected from two modules : ICT specialists and skills, and access and use of the internet. To model questions about ICT specialists, the auxiliary information from the annual declaration of social data (DADS) is very useful. In this document, employers supply information like the description of the job using French classification “PCS-ESE” for each employee. In 2016, for the question about the employ of ICT specialists, the variable constructed with the DADS matches with survey responses for 80 % of enterprises. If we improve the ICT specialists classification it may be possible to remove this question. To model website characteristics, the answer to the question “during the last year, did your enterprise receive orders for goods or services placed via a website or “apps”?” is used. Also, when available, using the answers given by an enterprise in the previous edition of the survey improves the quality of the estimated models. But for this module, the lack of auxiliary information and the answers instability between different surveys editions make the estimations difficult. This research may be a starting point to improve jobs classification or use innovative techniques like web scraping to obtain website characteristics (cost and efficiency remain to evaluate). Moreover, it show why it is difficult to model survey responses: do we want to model the reality or enterprises responses? Finally, model responses to a community survey open up a scope of possibilities: it would be interesting to try to reproduce this research on others countries, using their own data sources.

The definition of the final disposition codes in the household surveys in the context of the new integrated management system in Istat 106 Manuela Morricone, (Email), Novella Cecconi, (Email), Rita Ranaldi, (Email) ISTAT (National Statistics Institute of Italy), Rome, Italy The new Istat Directorate for Data Collection, from its constitution in April 2016 to now, is focusing on some strategic projects, aimed at harmonizing and making more efficient the data collection operations for all the surveys conducted by the Institute. Among these projects, there is the integrated management system, already being implemented for the 2018 Permanent Population and Housing Census. The IT development of the system requires a reflection, in particular from a theoretical point of view, in order to identify and implement all the functionalities necessary for the management of each survey, within a single conceptual framework. A fundamental aspect of the design of this integrated management system is certainly the definition of the status and of the final disposition codes for the survey units. Accompanied by an appropriate set of rules of assignment, they are the basis of some important functions connected to the implementation of the activities of implementation, monitoring and control of the survey. To overcome the difficulties related to the peculiarities of the single surveys, characterized by different sampling designs and survey modes, we proceeded to make an in-depth comparison between the systems of the outcomes currently used, which led to: • reduce redundancies; • respect the peculiarities without loss of information; • harmonize the final disposition codes of the survey units at a higher level of synthesis. The status and the outcomes of the survey units thus allow to monitor the phases of their initial assignment, of implementation of the field work and of final validation. Furthermore, the final outcomes (after validation) allow the calculation of indicators on survey quality (mainly coverage and non- response rates), in compliance with national (SIDI) and international (AAPOR) standards.

Open data sources for retrieving information on multinational enterprise groups 021 Dimitar Nenkov, (Email) 1, Sebastian Hellmann, (Email) 2, Johannes Frey, (Email) 2 1 European Commission, Luxembourg, Luxembourg 2 Leipzig University, Leipzig, Germany Eurostat governs the EuroGroups Register (EGR), the statistical business register of multinational enterprise groups (MNEs) in the European Union and EFTA countries. Sources of information such as crowdsourcing platforms, web crawling and different open data projects are seen as further opportunities to increase the quality of the EGR, its completeness and accuracy namely with the units outside of the EU and EFTA as well as on the whole group level. Under the umbrella of Eurostat BIG DATA project, the EGR Team is investigating these additional data sources. Eurostat is collaborating with Leipzig University to explore the possibility of using DBpedia as new additional source of data of multinational enterprise groups. EGR Team and Leipzig University carried out a Prove of concept (PoC) to automatize at a large extend the collection of aggregated group figures using as input the names of the enterprise groups.

Estimation of measurement errors in social survey 200 Nhu Tho Nguyen, (Email) Ku Leuven, LEUVEN, Belgium Social survey data are often collected through human interaction, which is prone to error. The data is collected with the assumption that the characteristics and concepts being measured may be precisely defined, and can be obtained through a set of well-defined procedures, and have true scores independent of the survey. Measurement error is then the difference between the value of a characteristic provided by the respondent and the true but unobserved value of that characteristic (Groves, 1989; Kasprzyk, 2005). To capture the true score of a characteristic, a survey question must be “valid” and “reliable”. The validity of a question is an evaluation of whether it measures a construct or variable that it is intended to measure (Carmines & Zeller, 1979). Reliability reflects the amou nt of error inherent in any measurement and hence how replication of the administration would give a different result (Streiner & Norman, 1995). If a survey variable is unreliable, the statistical inferences obtained using that variable is in turn also untrustworthy. Before we can use a survey variable, we must first be able to quantify how reliable it is.

Exploring the use of scanner data in the Norwegian CPI for products with high churn 037 Ragnhild Nygaard, (Email) Statistics Norway, Oslo, Norway Statistics Norway has a clear strategy of making increasingly use of new data sources in official statistics. The national statistical institute (NSI) has a long history of using scanner data in price statistics and in the Consumer Price Index (CPI) in particular. With scanner data we mean aggregated transaction data that provides information on turnover and quantity sold by certain article codes or barcodes. By increasing the use of scanner data in the price index the use of more traditional data sources like web questionnaires filled out manually by retailers can be reduced equivalently, and by that lowering the response burden and increasing index quality. The calculation method mostly applied in the Norwegian CPI on scanner data is a matched model approach at article code level aggregated by a monthly chained unweighted geometric mean index (Jevons index) – referred to as the “dynamic method” according to Eurostat’s practical guide on scanner data. The method works well on relatively stable article codes like for most supermarket data, but is not appropriate for products with high churn i.e. more frequent changes in article codes, like clothing and consumer electronics for instance. The aim of the present scanner data development work in Statistics Norway is to implement a more generic method that is able to handle frequent changes in article codes. Secondly, a new calculation method should also preferably use both prices and quantities without causing index bias. New calculation methods like multilateral index methods presented internationally during the last couple of years, do not by themselves solve the problem with product replacements. This issue must therefore be addressed separately. One crucial step is product definition. Defining the product at article code level may in many cases be too detailed especially for products with high churn as the match between new and old article code is lacking. A more appropriate approach could be to apply a broader definition of a product, for instance to combine different article codes of similar attributes of which the consumers are indifferent to. A practical solution is to create these homogenous products (HPs) by clustering together homogenous article codes and calculate a unit value. By calculating a unit value across homogenous article codes we allow for comparisons of new codes entering and old ones disappearing from the market. This paper presents work done in an ongoing grant agreement project and presents challenges related to HP definition and formation as well as effects on different multilateral index formulas.

Joint distribution of income, consumption and wealth 033 Friderike Oehler, (Email) Eurostat, Luxembourg, Luxembourg Statistical matching is relying on strong assumptions, thus products based on this method can only qualify as experimental. However, it is the only possibility, which is currently available, to produce joint micro datasets of income, consumption and wealth for households in all EU countries. The analysis of a joint distribution of income, consumption and wealth data complements classic poverty indicators that rely merely on one of the three dimensions of economic well-being. Indicators based on the joint micro dataset also enhance the knowledge on economic trends gained through macro economic analysis. Most importantly, they shed light onto the household perspective at different levels of the distribution.

Comparing sectoral productivity of European countries with purchasing power parities 165 Laurent Olislager, (Email), Marjanca Gasic, (Email), Paul Konijn, (Email) European Commission - Eurostat, Luxembourg, Luxembourg Purchasing power parities (PPPs) are indicators of price level differences across countries. As listed in [1], they have a wide range of applications. Importantly, they allow international comparisons of the size of economies, which would be biased without adjusting for price level differences. The Eurostat-OECD PPP Programme makes use of price surveys and a well- established methodology, presented in [2], to estimate PPPs from the expenditure side of the gross domestic product (GDP). This approach presents a number of advantages but does not identify individual sectors of the economy. Therefore, productivity comparisons can be made only at the level of the whole economy. In this work, we use an alternative calculation of PPPs, described in [3]. We estimate PPPs from the production side of GDP, which allows identifying the output of specific sectors of the economy. From them, we derive experimental PPP- adjusted productivity measures at the industry level. This is a key indicator for competitiveness analysis, in high demand from users who lack official datasets to design policies towards increasing productivity and growth. We build on the sources and methods of [3], where output PPPs were estimated for the year 2014 for 31 European countries, and extend the results and refine the analyses thereof. First, we obtain a time series of output PPPs for the period 2008- 2017. Second, we assess its quality in terms of coverage, reliability and plausibility with numerical criteria and comparisons to other data sources. Third, we investigate the possibilities to extend the coverage of activities to include those that are mainly intermediate consumption, particularly for services. This allows in principle calculating value-added PPPs instead of output PPPs, strengthening their use for productivity comparisons.

The Innovation Lab 115 Enrico Orsini, (Email), Gerarda Grippo, (Email), Mario Magarò, (Email) ISTAT -ItalianNational Institute of Statistics, Rome, Italy The evolution of the demand for statistical information, puts the National Statistical Institute (NSI) facing an important challenge: to improve its ability to innovate in processes and products in order to respond more effectively to new and growing needs. The Laboratory for Innovation is one of the infrastructures adopted to respond to this new challenge with the aim of facilitating the development of innovation and reinforcing the role of research as a founding value and a tool for strategic growth of the Institute and staff. The Innovation lab is an “environment” to enforce the research role and to develop innovative ideas. It is a place to test new solutions, new processes and new products. It enables the institute to invest in innovative projects to improve statistical information, processes and products and strengthen internal and external relations through partnerships with universities, institutions and public and private research institutions. Thanks to the work done in collaboration with various departments of our institute, the Innovation Lab was inaugurated in March 2018 and activities have been started on the projects selected in the first call of the innovation laboratory. The first call, completed in September 2017, was a success with 27 projects submitted, proposed by 33 researchers. Of these, 6 ideas has been selected were more focused on the production of an experimental statistical output in accordance with the areas of the innovation program such as Big-Data, machine learning and Data Integration. These results are confirming the role defined for the innovation laboratory, promoting it in the institute as a tool for research and innovation, as a channel to promote collaborations and partnerships and as infrastructure to encourage colleagues and researchers to develop innovation.

Using the state-space framework of JDemetra+ in R 149 Jean Palate, (Email) National Bank of Belgium, Brussels, Belgium State-space models provide a unified approach to a wide range of problems in the time series domain. Since its creation, the software JDemetra+, which has been officially recommended by Eurostat and the ECB for seasonal and calendar adjustment of official statistics, makes a huge use of such models. The state-space framework of JDemetra+ is based on an original object- oriented design and on advanced algorithms that make it especially powerful. Even if it constitutes the kernel of many high-level routines, the state-space framework remains largely underutilized. This is basically due to the complexities of the matter and to the barrier of the programming language – Java – to access its functionalities and/or to extend them. To increase the use of the routines, we have built new modules that highly simplify in Java the creation and the estimation of univariate or multi-variate state-space models and a companion R-package for using them transparently from that well-known environment. We illustrate this new tool by some examples in R. The first one is a seasonal specific structural time series, which models series with high volatility in some periods; this is in the seasonal adjustment domain the model- based equivalent of the seasonal specific filters of X11. The second example consists in a multi-variate modelling corresponding to a complex survey design with rotating panels (Labor Force Survey in Australia). The model can be used, for instance, for the measurement of the impact of changes in the survey. Many other problems could be handled in a similar way.

Gustave - An R package for variance estimation in surveys 133 Nicolas Paliod, (Email), Lionel Delta, (Email), Thomas Deroyon, (Email), Martin Chevalier, (Email) Insee, Paris, France Estimating the variance of survey estimates is an important but often difficult step in their evaluation and dissemination process. Precision estimates play a key role in European surveys quality reporting, as defined in European regulations, for instance the Integrated European Social Statistics frame regulation currently under negotiation. They help users in their analysis of disseminated aggregates, especially when computed on domains (industries for business statistics, regions for household statistics). Errors in surveys may have multiple sources and causes, which are for the most part difficult to evaluate quantitatively. In survey data, sampling and nonresponse errors represent a significant part of the total error, that official handbooks strongly recommend to take into account and estimate. In France, the precision estimations performed by the National Statistical Institute Insee for its surveys incorporate the following elements: sampling design, unit nonresponse usually treated by reweighting methods, influential units treatments, calibration. Variance estimation for household surveys uses analytical formulas, taking into account the real sampling design according to which the sample has been selected. These analytical formulas are implemented in taylor-made precision estimation programs, that are developped in R. Due to the complexity of sampling designs and survey data treatments, especially for household surveys, estimating variance, that is defining the precise analytical variance formula to be used and implementing it in a statistical software, is a demanding and time-consuming task, usually performed by members of the methodological staff. However, the computation of precision estimates should also be made available to data users such as subject matters experts, so that they could as easily as possible compute the variance of the variables they wish to comment and disseminate. The Gustave package has been conceived as a tool to facilitate precision estimates computation and to change their organisation. Its goal is to lessen the charge of the methodological staff and limit it to the more technical tasks for which their expertise is needed. We will here limlit our presentation on the general principles according to which Gustave is organised. The final paper will include an example of variance computation with Gustave on the French Labour Force Survey.

The use of metadata to manage data processing processes and the definition of data validation rules 152 Marek Panfiłow, (Email) Statistical Office, Olsztyn, Poland Conducting statistical surveys requires performing a number of specific steps beginning from collecting data, through processing them to preparing results. Each step is a process. The processes are dependent on each other and must be performed in a specific sequence. The large number of such processes and their interconnectedness leads to difficulties in determining what processing step we are currently in and what the next step should be. The solution is to describe these dependencies in the form of metadata and to manage them appropriately. One of such processes is the data sets validation, which consists data validation rules. Rules are created by experts in a given topic and then implemented in the data processing system by technical persons. Such implementation is not always trivial. The solution is to describe the rules in the form of notations that can be automatically translated into an executable form. Such records may be stored in the form of metadata and used by the data processing system.

Centralised data collection: process innovations and main results in business surveys 120 Pasquale Papa, (Email), Giampaola Bellini, (Email), Francesca Monetti, (Email) Istat- Italian National Statistical Institute, Rome, Italy Several National Statistical Institute in the last years started a reorganization process whose main objective was to enrich the supply and the quality of the information produced, improving the effectiveness and efficiency of the statistical processes. The deriving new organizational set-up were characterized by the centralisation of all the support services, that were clearly separated from statistical production. The introduction of a structure characterized by production sectors on the one hand and service-providing sectors on the other, pushed towards the "transversalization" and specialization of many services that were managed in a specialized manner. The result obtained consisted of a strong standardization and harmonization of all these services and in particular of data collection.The introduction of a specialist data collection, led to the review of the organizational structure of data collection processes and the redesign of many of the management procedures adopted.Process innovations introduced in structural and short-term business surveys are mainly based, on one hand, on the implementation of infrastructural solutions, as the single access point to data acquisition systems - the “Business statistical portal” - and the centralized Contact center for inbound and outbound services; both solutions are managed through a detailed calendar of each data collection activity. On the other hand, process innovations are based on the standardization and generalization of each phase of the data collection process and on the specialization of personnel devoted to specific transversal activities. Main results were the significant increase in medium response rates and the reduction of the data collection periods. A detailed description of the innovations and standardization adopted and of the consequent results will be provided.

Process innovations, integrated approach and development perspectives in the implementation of Data Collection of agricultural surveys 114 Loredana De Gaetano, (Email), Pasquale Papa, (Email) Istat - Italian National Statistical Institute, Rome, Italy The introduction of centralized and integrated Data Collection (DC) models has brought important process and product innovations, improving data quality and response rates for most of surveys. The new set-up and innovations have had a significant impact on the surveys of the agriculture sector which received a particular impulse in terms of efficiency. In general, the centralization of Data Collection aims to enrich the offer and quality of information produced, improving the efficiency of statistical processes. The resulting organizational structure is characterized by the clear separation between support services and thematic ones that are managed by statistical production. The new model limits the role of production structures only to the thematic aspects, while the transversal skills are assigned to specialized sectors. The introduction of an organizative set set-up characterized by thematic sectors on the one hand and service sectors on the other, prompted the "transversalization" of many services that are thus managed in a very specialized way. The result obtained consists in a standardization and harmonization of all the processes and in particular those related to data collection. The introduction of a specialized data collection also leads to the revision of the organizational structure devoted to Data collection processes and the redesign of many of the management procedures adopted. Before the adoption of these centralized models, the statistical processes were organized according to the classical "stovepipe" model, which involved independent statistical processes, not integrated, including all the necessary skills: statisticians, information technology experts, thematic experts, methodologists. This choice, although it was characterized by a high probability of achieving the set objectives, in terms of compliance with the Regulations and compliance with national dissemination plans, implied a very low overall efficiency level, due to redundant overlaps and lack of integration between the processes. The main trends underlying the centralization needs are the decreasing number of human resources assigned to the National Statistical Institutes, the greater degree of training and specialization of available human resources, the development of communication and information technologies, the computerization of the main survey units where the data are collected (companies, institutions, individuals and households), the need for greater consistency between the statistical indicators produced, in particular at the level of national accounting indicators. The new centralized management model for Data collection pushes to the maximum the possibilities of standardizing the processes, notably the DC field implementation processes, placing the unique limitation to the respect of the sector specificities related to the types of units involved (companies, households, individuals, institutions, farms) and the collection technique used (for example, use of a territorial network, intermediate bodies, etc.). The positive results, obtained following the introduction of centralized data collection, originate mainly in the standardization of data collection processes.

Extraction of occupation, competences and qualifications from Internet job offers for official statistics 121 Robert Pater, (Email) 1, 2, Maciej Beręsewicz, (Email) 3, Łukasz Cywiński, (Email) 1 1 University of Information Technology and Management, Rzeszów, Poland 2 Educational Research Institute, Warsaw, Poland 3 Pozna? University of Economics and Business, Pozna?, Poland There is an increasing need for analysing detailed companies’ demand for workers, that is for occupation, skills or competences and qualifications. Current surveys conducted by Statistical Offices do not contain information on demand of companies for future workers’ competences or qualifications. One might consider online job offers as support or alternative for surveys on vacancy market. However, these data sources are unstructured and relevant information should be extracted to conduct quality and representativeness assessment and estimation process. Our main contribution is the proposal of an efficient method for analysing Internet job offers in the context of detailed information they contain. The method is based on gathering Internet job offers and analysing them with text mining and machine learning tools. We present caveats behind such a research and propose solutions to them. We apply this method for the Polish vacancy market and compare results with ongoing representative surveys to correct non- probability character of these data. The results are especially important for economists, education sector, and labour market institutions, e.g. for shaping the OECD Skills Strategy. Such detailed information may be used to adjust labour market and education policy, especially directed to reducing the structural unemployment.

Africa’s Participation in International Statistical Literacy Project 017 Elieza Paul, (Email) International statistical Literacy Project(ISLP)-Tanzania, Mwanza, Tanzania Although the Global and continental progress on strengthening the Statistical development in Africa is growing especially Official Statistics, the international statistical literacy project a tool that has to add value of Official statistics in Africa remains limited and taken for granted. Recently, the proportion of African countries that participated in promoting statistical literacy through international statistical literacy poster competitions compared to those in the other regions of the world is significantly lower up to this time,18 African countries have either shown interest or participated in International statistical Literacy project since it started, 9 out of 18 countries have only shown interest but haven’t participated neither at a national level nor international level . Through several literature and data collected by interviewing 53 different ISLP country Coordinators across Africa through their emails, this paper explored the real challenges that hinders this very important initiative in their respective countries; Lack of volunteerism spirit, political commitment ,language barriers and less efforts shown by National statistical office were among of discovered challenges and cooperation of all actors dealing with statistics, especially between ISLP, NSOs and academics is a suggested perspective of how this participation can be improved to work to the benefit of having statistically literate society through participation in international statistical literacy project (ISLP).

Integration of inconsistent data sources using Hidden Markov Models 153 Paulina Pankowska, (Email) 1, Bart Bakker, (Email) 1, 2, Daniel Oberski, (Email) 3, Dimitris Pavlopoulos, (Email) 1 1 Vrije Universiteit Amsterdam, Amsterdam, Netherlands 2 Statistics Netherlands, The Hague, Netherlands 3 Utrecht University, Utrecht, Netherlands National Statistical Institutes (NSI’s) increasingly obtain information on the same phenomena from different sources. These sources, however, despite official statisticians’ best efforts, often provide inconsistent estimates. These inconsistencies occur primarily as a result of measurement error. An attractive solution which could be applied to this problem, in the context of categorical data, is latent class modelling (LCM). In this method, the problems of data reconciliation and measurement error are solved simultaneously by linking two or more sources and modeling them as conditionally independent measures of an underlying true value. A specific group of latent class models applied to longitudinal data specifically are Hidden Markov Models (HMMs). While HMMs, serve as an attractive solution to the problems of discrepancy across longitudinal data sources, several issues need to be considered before they can be utilized in the production of official statistics. First, the procedures involved in applying and estimating HMMs are very complicated, time-consuming and expensive and, therefore, cannot be applied regularly. Thus, is it desirable to re-use HMMs estimates from previous time points with more recent data. This procedure may produce accurate estimates only if measurement error is time invariant. Second, as the method requires data linkage, it might lead to linkage error – a new potential source of bias. Therefore, there is also a need to examine the sensitivity of HMMs estimates to linkage error. In our research, we test the feasibility of using HMMs as a way to reconciliate different sources which measure the same phenomenon given the challenges outlined above. In doing so we apply an extended, two-indicator HMM to Dutch data on transitions from temporary to permanent employment coming from the Labour Force Survey (LFS) and the Employment Register (ER). Our results cast a positive light on the feasibility of using HMMs in official statistics production. Namely, we show that it is possible to re-use parameter estimates in more recent data, provided that the error parameters are time invariant. We also demonstrate that the sensitivity of the method to linkage error is rather low. Finally, we also illustrate how HMMs can be used to evaluate the effectiveness of various data collections techniques in producing accurate statistics.

Plutus – A new tool to handle metadata of seasonal adjustment 027 Mária Pécs, (Email) Hungarian Central Statistical Office (HCSO), Budapest, Hungary Seasonal adjustment is an everyday step of the statistical business process in the official statistics, therefore a proper system that guarantees the high quality results is important for every National Statistical Institute (NSI). The Hungarian Central Statistical Office (hereinafter referred to as: HCSO) operates a centralized internal system for seasonal adjustment. The aim of Plutus (a new developed system described) is to help the collaboration of colleagues from different departments and solve the problem of increased data set, especially during the annual model revisions. Plutus is a system of databases with a user friendly interface, fully metadata driven. The biggest advantage of Plutus is that the whole system is more transparent than before and the collected metainformation can be linked to any other metadata object. This tool is now under construction and the planned implementation is the next annual revision period in 2019.

Youth cultures: a text-driven automatic approach. 203 Felicia Pelagalli, (Email) 1.Culture srl, 2.Sapienza University, Rome, Italy Introduction In 2017, in Italy young people aged between 15 and 29 represent 15.1% of the population (16.7% Euro Area). The number of unemployed aged 15-24 reaches 34.7% (18,8% Euro Area). The ratio of NEETs (young people who are Neither Employed nor completing their Education and Training), aged between 15 and 24, goes up to 20.1% (11,2% Euro Area). Finally, the most alarming figure: 34.7% of young Italians, aged between 15 and 24, are at risk of poverty or social exclusion (28,4% Euro Area). [Source Eurostat] Who are they? What do they think? What are their cultural models? In this paper I present an investigation of the different youth cultures, about the relationships between youth and the job’s world, realized through an algorithm of text analysis based on the co-occurrence of words. Methodology I analyzed the conferences of a Hackathon day (“TipoHack”) which took place in Rome the 18 of October 2018, starring 80 young men and women aged between 18 and 29. The presentations of the ideas elaborated by the eight teams have been recorded and written integrally and they represent the corpus in analysis. The corpus has been treated using an algorithm that allows us to analyse similarity matrices in order to provide a visual representation of the relationships among the data within a space of reduced dimensions, and to interpret both the relationships between the "objects" and the dimensions that organize the space in which they are represented. Results On the horizontal plane we spot a sharp contrast between on the one hand a motivation to affiliation (the search for the other’s acceptance, for a sense of belonging, of a reassuring way of depending on others), and on the other hand a motivation to success (strong input toward success, assumption of responsibilities and entrepreneuring). On the vertical plane prevails the necessity to define an identity (who I am) on one side, and the need to find an occupation (what do I do) on the other side. It is following this to guidelines that the different youth cultures have been articulated in the word map. (Fig. 1) The analysis defines a complex and articulated frame of youth’s world. There are clusters and worlds of “waiting and expectations” regarding what the market may offer ( work offer, offer, advertisement, path, professional, problem, subject), looking for guarantees and for support within a word of adults and disappointment; and there are clusters ready to “undertake”, going toward the construction of a project ( succeed, passion, formation, know, skills, possibilities, people). The Net and culture of peer to peer open to possible hypothesis of exchange between models and skills. It is important to confront young people with the job’s world, but also with each other (beyond “bubbles” and neighborhoods), to go farther than the classic barriers between different worlds (segments, tribes, clusters) that do not meet, to work on common projects. It is important to begin to think about new platforms that will be able to plan the the future of work.

Transforming Health and Social Care Publications in Scotland 068 Anna Price, (Email) Information Services Division, NHS Scotland, Edinburgh, United Kingdom Transforming Health and Social Care Publications in Scotland Anna Price Introduction The Information Services Division (ISD) of the National Health Service Scotland produces around 200 health and social care publications each year which are designated either official or national statistics. Most publications are produced using SPSS, and published as static PDF documents with accompanying excel tables. Feedback has shown that our data can be challenging to find and digest in this format. Furthermore, production is time-consuming, involving extensive manual formatting and checking. The transforming publications programme aims to modernise how ISD produces and releases data. Methods We have been using a combination of Data Science, DevOps and user-driven design principles. We first identified our most common customer types and developed a set of personas. Customers were engaged with directly through interviews and focus groups to identify if our perceptions of customers’ needs reflected reality, and the findings were translated into features for development. Once the team had an informed understanding of customer needs, we designed a new method of publishing data, focusing on one publication as a proof of concept. The team worked iteratively, involving customers to test and provide feedback on the new platform as it underwent development. To build this platform and make the publishing process more efficient, robust and reliable, we transferred data production from SPSS to R, using modern data wrangling code from the tidyverse suite of packages. To build the new publication platform we used a combination of RShiny dashboards and D3 charts. We used git and GitHub for version control and have published the code behind the RShiny dashboard. We also developed an R Style Guide, GitHub best practice and a suite of R resources in order to facilitate learning, development and collaborative working within ISD and across the wider public sector in Scotland. Results A prototype for statistical publications was released in December 2017. We have encouraged continual feedback on the new publication, allowing the development of additional features which will help refine the product further. In September 2018, a second publication was released using this new publication format. We are now working with several teams within ISD to transform their publications into this new design. We are also developing the first Reproducible Analytical Pipeline for an official statistics publication in Scotland in order to streamline the production process further. Conclusions By co-designing a new model of presenting data, we can provide customers with the data they need in a way that they can understand. Furthermore, the new automated method of producing the publication has created time savings and reduced the risk of manual errors.

IT Infrastructure for a Data Science Campus 113 Craig Pritchard, (Email) Data Science Campus, Office for National Statistics, Newport, United Kingdom The modern economy provides both the challenge of measuring fast-evolving forms of economic activity, and the opportunity to exploit huge amounts of new data and information to help policymakers, researchers and businesses. The Data Science Campus (DSC) was created to respond to this challenge and acts as a hub for the whole of the UK public and private sectors to gain practical advantage from the increased investment in data science research and capability building. Our goal at the Data Science Campus is to explore how new data sources and data science techniques can improve our understanding of the UK’s economy, communities & people. To circumvent the users restrictions of a standard corporate network, The Data Science Campus have created an isolated Data Science Campus Network (DSCN) providing data scientists with the infrastructure, IT services and development tool sets required to research and develop the next generation of statistics. The DSCN provides a mechanism to ingest and process both structured and non-structured partner data by means of a Trusted Data Zone (TDZ) and as a result has led to various innovative projects.

IT infrastructure for Big Data and Data Science: Challenges at Statistics Netherlands 094 Marco Puts, (Email) CBS/Statistics Netherlands, Heerlen, Netherlands Until a couple of years ago, processing data at national statistical institutes (NSI’s) did not differ that much as processing data at an administrative environment. In case of surveys, one needed a database comprising all companies or persons in the country, take a sample of this population, and, when the questionnaire was sent back, process the (relatively) small amounts of data to calculate the wanted estimates. In case of register data, the amount of data was higher, but still manageable for relational database management systems (RDBMS) or even in the form of comma separated values. Most of the data sets would fit in memory of small dimensioned desktops. The role of statistical methodology was simple: Given one or several datasets, define an algorithm that executes a certain method for estimating the target variables. Processes as data cleaning could be rule based or could encompass an iterative process to find the “right” values. Most of the times, these algorithms did not need to be efficient, since inefficient algorithms would maybe run for a couple of hours once every three months or once a year. In the age of big data and data science, this is not the case anymore. Nowadays, many questions can be asked and answered by using more timely data and it is expected from modern statisticians to be data scientists, who use modern technology to be able to quickly answer current questions. Transitions of the used systems is not an easy task. The administrative landscape is based on low data throughput and relatively small latencies can be relatively easy realized. For data science infrastructures, this is another story. Small latencies are not easy to realize. In certain cases, a high data throughput needs to be realized and this cannot be realized with more traditional infrastructures. Whereas in traditional infrastructures, vertical scaling, or scaling up, was normally done when more compute power was needed, horizontal scaling, or scaling out, is used in data intensive infrastructures. More servers are coupled to create clusters, which scale easily with the data [1]. In this paper, we will discuss what the transition from “classical” statistics to modern data science means and what challenges lie ahead when implementing.

RJDemetra: an R interface to JDemetra+ 053 Alain Quartier-la-Tente, (Email) 1, Anna Michalek, (Email) 2 1 Insee, Paris, France 2 European Central Bank, Frankfurt am Main, Germany RJDemetra is a R interface to JDemetra+, the seasonal adjustment software officially recommended by Eurostat and the European Central Bank (ECB) to the members of the European Statistical System (ESS) and the European System of Central Banks (ESCB). JDemetra+ is a developed by the National Bank of Belgium (NBB) in cooperation with the Deutsche Bundesbank with the support of Eurostat, in accordance with the ESS Guidelines on Seasonal Adjustment. It implements the two leading seasonal adjustment methods TRAMO/SEATS+ and X-12ARIMA/X-13ARIMA-SEATS. RJDemetra is a R package that offers full access to all JDemetra+ options and outputs. It also offers many possibilities to the users to implement new tools for the production of seasonally adjusted series thanks to all the libraries already available in R.

Predictive performance of a hybrid technique for the multiple imputation of survey data 011 Humera Razzak, (Email) 1, Prof. Dr. Christian Heumann, (Email) 2 1 Ludwig-Maximilians-Universität, Munich, Germany 2 Institute of statistics LMU, Munich, Germany Analysis of data for scientific investigations becomes complicated, biased and less efficient in presence of missing information. In recent decades, lots of effort has been made in development of statistical methods to carter missing data. In many survey based studies, the logistic regression model is used to investigate the effect of various background characteristics (e.g. demographics, age, education, motherhood and recent births etc.) on a binary outcome variable such as breast feeding practices. This model can be difficult to apply when the confounding variables are missing.A popular chained equations model MI approach called Multivariate Imputation by Chained Equations (MICE) fails to perform sometimes due to computational efficiency, complex dependency structure among categorical variables and high percentage of missing information in large scale survey data.We develop a Hybrid Multiple- Imputation (HMI) approach for handling data for the problem described above. The proposed missing data imputation approach is a 3-stage approach. The relationship between binary response (Ever breastfeed) and explanatory variables is modelled using a generalized linear model (GLM). The accuracy of predictive distributional model is assessed by the area under the receiver operating characteristic (ROC) curve, known as (AUROC) and the results obtained under purposed and existing MI methods for large spectrum of data characteristics are compare.Better predictive performance with minimum computational time as compared to the existing methods is partly achieved in simulation studies.

GSBPM Next Level - a proposed evolution of the model 143 Laurie Reedman, (Email), Jackey Mayda, (Email), Paul Holness, (Email), Alice Born, (Email) Statistics Canada, Ottawa, Canada The current version 5.0 (2014) of the GSBPM has been adopted by over 50 jurisdictions including Statistics Canada. The first version 1.0 of the model was introduced in 2008. Since that time the model has undergone significant changes in terms of its strategic focus and design. However, over the past 10 years, we have witnessed a tremendous increase in the development, consumption and integration of data. The demand for real-time access to detailed data is growing at an enormous rate. Along with the increased volume of data, there has been a noted increase in the variety of input data sources including the internet of things, web scraping, scanner data, electronic questionnaires and a further increase in the use of administrative data. The increased volume and variety of data have been further accelerated by the evolution and proliferation of a host new technologies which have increased the velocity at which we are generating data. These changes are driving the need for improved data management, particularly in the preparation of official statistics.

Experimental statistics from an unlikely source 142 Laurie Reedman, (Email), Andrew Brennan, (Email) Statistics Canada, Ottawa, Canada Experimental statistics from an unlikely source In March 2018 Statistics Canada began data collection on an entirely new frontier. We are sampling wastewater. A literature review of wastewater epidemiology yielded the methodology for measuring the concentrations of metabolites in municipal wastewater systems and back-calculating the consumption of the parent drugs by the human population contributing to the wastewater catchment area. A pilot project was undertaken, primarily to test the feasibility of calculating the consumption of cannabis in five Canadian cities. This work explores the potential for passive monitoring of cannabis use to reduce dependence on self-reporting, and ultimately reduce response burden. The same methodology can be used for estimating the consumption of other drugs, further improving our ability to monitor health risks and illicit economic activity.

Deep Data and Shared Computation: shaping the future Trusted Smart Statistics 135 Fabio Ricciato, (Email), Albrecht Wirthmann, (Email), Michail Skaliotis, (Email), Fernando Reis, (Email), Kostantinos Giannakouris, (Email) EUROSTAT, Luxembourg, Luxembourg In this position paper we discuss the potential evolution of Official Statistics from an operation model based on data concentration towards an alternative model based on computation spreading. The latter involves a certain degree of participation by the input data sources to the definition of the desired output information (statistics) as well as to the execution and design of the processing method. This approach reinforces the protection of confidentiality of the input data and distributes the physical and logical control over the whole process across multiple actors. This model can be naturally combined with the principles of public transparency, algorithmic transparency and audibility. Furthermore, the “shared control” spirit underlying this model fits well in scenarios where independent authorities are called to validate and certify the adherence of the computation instance to the applicable legal provisions (e.g., privacy regulations) and ethical standards. The collective effect of such measures is to increase transparency, collective control, public trust and acceptance over the use of increasingly pervasive deep data for public interest purposes.

Privacy and data confidentiality for Official Statistics: new challenges and new tools 190 Fabio Ricciato, (Email), Aleksandra BUJNOWSKA, (Email) EUROSTAT, Luxembourg, Luxembourg The modern society is undergoing a process of massive datafication [1]. The availability of new digital data sources represents an opportunity for Statistical Offices (SO) to complement traditional statistics as well as to produce novel statistical products with improved timeliness and relevance. However, such opportunities come with important challenges in almost every aspect – methodological, business models, data governance, regulation, organizational and others. The new scenario calls for an evolution of the modus operandi adopted by SO also with respect to privacy and data confidentiality, that is the focus of the present contribution. We propose here a discussion framework focused on the prospective combination of advanced Statistical Disclosure Control (SDC) methods with Secure Multi-Party Computation (SMC) techniques.

Nowcasting GDP for the Baltic States: A comparative approach in support of official quarterly GDP forecasts 204 Dan A. Rieser, (Email) Directorate-General for Economic & Financial Affairs (DG ECFIN) ECFIN.E.2 - Economies of the Member States III: Estonia, Latvia, Lithuania, Netherlands CHAR 10/162 Rue de la Loi 170 BE - 1049 Brussels, Bruxelles, Belgium The purpose of this analysis is to make a contribution to GDP nowcasting and thereby to support DG ECFIN’s quarterly forecasting process for key macroeconomic variables such as most notably GDP. This will be done by utilising what are known as “nowcasting procedures” and by enhancing existing forecasting techniques. Building on the work carried out by Rieser (2017) and outlined in EUROSTAT (2017) as well as EUROSTAT (2018, which summarises in a succinct manner key advances made in the area of nowcasting to date), different econometric techniques will be used for nowcasting purposes. These techniques are: Univariate procedures, ARIMA, structural VARs and behavioural econometric models (the latter making use of a parsimonious set of explanatory variables that allow forecasting GDP). The work done by Poissonnier (2017) will be enhanced by choosing a quarterly approach which adds more granularity to the nowcast albeit increasing its complexity at the same time. JDmetra+ can be used to examine existing pattern of seasonality in the data. This nowcasting approach for GDP is envisaged to support the quarterly forecasting process for GDP in DG ECFIN. The aforementioned econometric techniques will be applied to the Baltic States: Estonia, Latvia and Lithuania. Despite their similarities, structural differences exist among the three states. All three are highly dependent on external financial flows, mainly from EU structural funds. GDP in the Baltic States is hence significantly affected by a long-term structural trend, seasonality, cyclicity as well as a decline and a deceleration and a rebound following the 2007/08 Global Financial Crisis. In a further stage, the analysis can be extended by analysing the interdependences between three countries including lead-lag relationships. The work is expected to be completed in Q3/2019. R will be used for the implementation of the econometric models.

Quantifying the development of Hungarian counties with LISREL estimation procedure during the period 1990-2016 089 Ildiko Ritzlne Kazimir, (Email), Klaudia Matene Bella, (Email) Corvinus University Budapest, Budapest, Hungary Development of Hungarian counties is described generally using the indicator GDP per capita. This is a very simply approach because this phenomena is more complex, it is multidimensional problem. We argue that development of counties is a latent variable. A linear regression cannot be made because the dependent variable is unknown. But a special factor analytic method is able to give a solution for this problem. The unobserved dependent variable is influenced by determinants and in turn has an effect on the indicators. Using the LISREL estimation procedure, it is possible to quantify the relative development level of counties. This method is used in calculation of hidden economy but we can prove that this is a useful method to quantify other latent variables. Having a measure of development it can be analysed the assumption that the integration into the global value chain of Hungary has a significant impact on development of counties during the period reviewed.

Applying Machine Learning for Automatic Product Categorization 012 Andrea Roberson, (Email) U.S. Census Bureau, Suitland, United States The North American Product Classification System (NAPCS) is a comprehensive, hierarchical classification system for products (goods and services) that is consistent across the three North American countries, and promotes improvements in the identification and classification of service products across international classification systems, such as the Central Product Classification System of the United Nations. Every five years, the U.S. Census Bureau conducts an economic census, providing official benchmark measures of American business and the economy. Beginning in 2017, the economic census will use NAPCS to produce economy-wide product tabulations. Respondents are asked to report data from a long, pre- specified list of potential products in a given industry, with some lists containing more than 50 potential products. Many of the more than 1,200 NAPCS codes can be very complex and ambiguous. Businesses have expressed the desire to alternatively supply Universal Product Codes (UPC) to the U. S. Census Bureau, as this is something they are already storing in their database. This research considers the text classification problem of predicting NAPCS classification codes, given UPC product descriptions. We present a method for automating the Economic Census by using supervised learning.

Enhancing Official Statistics with Remote Sensing Data 105 Natalie Rosenski, (Email) Federal Statistical Office of Germany (Destatis), Wiesbaden, Germany In order to further enhance official statistics especially regarding timeliness, accuracy, relevance and response burden, the exploration of the use of new digital data (also known as big data) for official statistics is essential. For this reason, different data sources are examined, such as mobile phone and remote sensing data and different techniques such as web scraping. The goal of all these studies is to get a more comprehensive picture of the society and the economy, ideally through the combination of survey, administrative and new digital data. Remote sensing data have a huge potential for the production of official statistics, there are several international projects ongoing that explore the use of remote sensing data, especially of satellite data, regarding its use for the determination of different indicators through the detection of different objects. The focus of this abstract, however, is on two on-going projects, which combine remote sensing data with traditional data sources to determine social and economic indicators.

Improving Data Validation using Machine Learning 170 Christian Ruiz, (Email), Christine Ammann Tschopp, (Email), Elisabeth Kuhn, (Email), Laurent Inversin, (Email), Mehmet Aksözen, (Email), Stefan Rueber, (Email) Swiss Federal Statistical Office, Neuchâtel, Switzerland The aim of this project is to extend and speed up data validation at the Swiss Federal Statistical Office (FSO) by means of machine learning algorithms and to improve data quality. Statistical offices carry out data validation (DV) to check the quality and reliability of administrative data and survey data. Data that are likely to be incorrect are sent back to data suppliers with a correction request. Until now, such DV have mainly been carried out at two different levels: either through manual checks or automated processes using threshold values and logical tests. This process of two-way “plausibility checks” involves a great deal of work. In some cases, staff is required to manually check the data again, in other cases rules are applied that often require additional checks. This rule-based approach has developed from previous experience but is not necessarily exhaustive and always precise. Machine learning has the potential to ensure faster and more accurate checks. This project is one of the five (pilot) projects currently being developed in line with FSO’s data innovation strategy (FSO, 2017) with the goal to augment and/or complement the existing basic official statistical production at the FSO.

Sustainability, consumption, resource productivity: regional material flow accounts 128 M. Carme Saborit, (Email), Jordi Galter, (Email), Cristina Rovira, (Email) The Statistical Institute of Catalonia (Idescat), Barcelone, Spain The United Nations defines sustainable development as 'development that meets the needs of the present without compromising the ability of future generations to meet their own needs'. Sustainable development necessarily entails decoupling the consumption of resources from economic growth. Material flow accounts (MFA) shows the physical inputs of materials which enter the economic system and the outputs generated in terms of physical units. These accounts enable us to obtain a set of aggregate indicators on the use of natural resources, from which productivity indicators can be derived. The Statistical Institute of (---), is conducting a statistical project on the material flow accounts with the aim of facilitating a detailed description of the interactions between the economy and the environment, providing information on the sustainability of our economic model in accordance with the harmonized methodology defined by Eurostat, adapted to the regional scope (NUTS 2). This project is also justified by the need for data and indicators derived from the circular economy policies for which the MFA is a fundamental pillar. The Regional Government’s 2017 approval of the drawing up of the National Plan for the implementation of the 2030 Agenda for sustainable development and the National Pact for Industry, which has an axis devoted to sustainability and the circular economy, have helped to promote the project. The Statistical Institute of (---) also forms part of the working group of the CITE (Interterritorial Statistics Committee) on indicators of the 2030 Agenda for sustainable development, with the purpose of exchanging methodological experiences on the preparation of the SDGs, and is establishing synergies with a technology centre in relation to the UrbanWINS project and promoting the integration of these results into an articulated information system, in cooperation with entities linked to the management of environmental and sustainability policies. This paper addresses the methodological adaptations which have been necessary and the limitations currently detected. It also outlines the guidelines for the future development of this module for environmental accounting in official regional statistics.

Attributes for Big Data for Official Statistics – an Application to Scanner Data in Luxembourg’ 205 Ibtissame SAHIR, (Email), Florabela Carausu, (Email), Botir Radjabov, (Email) GOPA Luxembourg, Luxembourg, Luxembourg Big Data is one of the key topics, around which the so-called data revolution has embarked. The increased usage of Big Data in the private sector has raised expectations from the public sector as well. On their turn, statistical offices have also embarked in innovative projects to benefit from the opportunities which Big Data can offer, simultaneously feeling an increased pressure from users to digitize their production systems. In order to balance users’ expectations to the real possibilities of making a good use of Big Data in official statistics, a series of attributes for Big Data, which should be present, for it to be a good fit for the purposes of official statistics, are proposed. Importantly, the classic attributes of Big Data – the 4 Vs: volume, variety, velocity and veracity – are insufficient to explore Big Data suitability for official statistics. As a first distinctive attribute, the scope, for which official statistics needs or may use Big Data, is different from the same scope of the private sector so as a former scope is decision-oriented analysis whereas a latter scope is an action-oriented analysis. Starting from this argument, attributes for Big Data for official statistics are proposed and exemplified through an application of scanner data in Luxembourg. Nowadays, Luxembourg (STATEC) is using bilateral dynamic index complication approach for scanner data on «Food» and «Non- Alcoholic Beverages» COICOP groups calculation, which implies that prices of goods of two consecutive months are taken into account with basket of goods resampled every month. STATEC is utilizing machine learning algorithms in items classification process to make classification process faster and almost automatic. Future goals regarding scanner data integration in CPI/HICP estimation are to include COICOP groups of «Fresh Fruits», «Fresh Vegetables» and «Alcoholic Beverages» into STATEC production system as well as to integrate multilateral index complication approach for scanner data for the above mentioned COICOP groups.

Inference with mobile network data 163 David Salgado, (Email) 1, Bogdan Oancea, (Email) 2, Luis Sanguiao, (Email) 1 1 Statistics Spain (INE), Madrid, Spain 2 Statistics Romania (INS), Bucharest, Spain Mobile network data, aka mobile phone data, stand as a promising data source for the production of official statistics. Several results already show their potential. However, the configuration of an end-to-end industrialised statistical production process using this source in combination with other data still needs further work to achieve usual quality standards in Official Statistics. The statistical production process is a complex process, which entails the need to deal with many highly interrelated different aspects. Some of these have been recently approached in the first ESSnet on Big Data (2016-2018) and are currently under further research in the second ESSnet on Big Data (2018-2020) like the geolocation of network events, the creation of a data model for statistical exploitation, the inference to the target population and the proposal of quality indicators. Here we present ongoing work in the use of hierarchical models in the inference from mobile phone data sets to the target population under analysis with the combination of auxiliary information. Our work adapts already existing proposals with administrative data to estimate the population size. We illustrate the use of these models with synthetic aggregate mobile network data about a closed population providing a proof of concept of this framework to approach the representativity issue with this new data source.

Big Data Architectures @ Istat 097 Monica Scannapieco, (Email), Natale Renato Fazio, (Email) Istat, Roma, Italy Modern organizations have recognized the importance of having a defined and standard architecture that is able to guide the implementation of the organization’s vision and to harness external drivers and changes. National Statistical Organizations (NSOs) are part of this trend and are more and more investing in defining architectures (business, application/information, IT) for their business. In the Big Data field, the need for defining appropriate architectures is, if possible, even more urgent, on the basis of the recent nature of the adoption of Big Data sources as additional sources for the Official Statistics production. The Italian National Institute of Statistics (Istat) has been investing in Big Data as new sources for Official Statistics since 2013, when the “Scheveningen memorandum” acknowledged Big Data to represent new opportunities and challenges for Official Statistics for the European Statistical System and its partners. In this paper, with respect to the current Istat’s situation, we will: • summarize the major investments made in the field of Big Data architectures till now and • highlight the major open challenges that we see as important to be solved in the next future for a full-fledged production of Big Data-based statistics.

Building the Italian Integrated System of Statistical Registers: Methodological and Architectural Solutions 182 Piero Falorsi, (Email) 1, Giorgio Alleva, (Email) 2, Orietta Luzi, (Email) 1 1 Istat, Roma, Italy 2 University of Rome La Sapienza, Roma, Italy In this paper, we will provide some insights on solutions adopted for solving some major issues encountered in building the Italian Integrated System of Statistical Registers, namely: (i) the integration between surveys and registers, exemplified through the case of the relationships between the Continuous Population Census and the Population Register; (ii) the problem of stock-flows harmonization; (iii) the measure of accuracy of register aggregates; (iv) the data architecture design and implementation, including specific architectural solutions for privacy. We conclude by highlighting the major challenges that are still open.

Smart Business Cycle Statistics 082 Clara Schartner, (Email), Markus Zwick, (Email) Federal Statistical Office (Destatis), Wiesbaden, Germany The Eurostat project ‘Smart Statistics’ started in February 2018 and will be finished in March 2019. At the end of January 2019, Eurostat will organise a workshop on ‘Trusted Smart Statistics: policymaking in the age of the IoT’ . The project includes three Proof of Concepts (PoCs): ‘Smart Mobility Statistics’ (PoC1), in ‘Smart Business Cycle Statistics’ (PoC2) and ‘Smart Labour Market Statistics’ (PoC3). The focus of this paper is the PoC2, which will explore how economic indicators can be derived from satellite imagery. Business cycles are important economic phenomena. The Gross Domestic Product (GDP) for developed countries occurs in cycles around a positive trend. These cycles have an enormous influence on society’s welfare and well-being. Basically, business cycles are the workload of the economic production factors labour and capital. The workload of the production factor labour is highly correlated to the employment rate and with this to the income of most of the households. Fluctuations in using the factor capital have an influence on the investments or the income of the capital owners for example. All these effects are able to reinforce or to stabilize business cycles and with this to influence the growth of the GDP. Because of the influence of business cycles for income and wealth, the economic parameters that are responsible for the cycles are of core interest to politicians. Their goal is to stabilize the growth of the GDP through economic policy. For this purpose they need information about the state of the business cycle. This can be done for example by forming indicators of the business cycle and combining them into a system [1]. Furthermore, it is important that this information is of high quality and up-to- date. Traditional methods of reporting the GDP in official statistics work very well and have a high accuracy. However, the reporting process is complex and introduces a time lag of several weeks to publication. The goal of ‘Smart Statistics’ is to reduce this time lag by deriving indicators from economic activities, which are visible in satellite images. Satellite images are available with a short delay of only a few hours. The processing of the data and the detection of economic activities can also be done comparatively fast and thus allows a publication of economic indicators with a delay of only a few days. However, while these indicators are based on auxiliary data and cannot be expected to have the same accuracy as traditional methods of determining the GDP, these indicators can help to determine the state of the business cycle in almost real time.

A bootstrap method for estimators based on combined administrative and survey data 169 Sander Scholtus, (Email) Statistics Netherlands, The Hague, Netherlands Administrative data are being used ever more frequently in the production of official statistics. In many cases, available registers cannot meet all demands made by users of official statistics and therefore have to be supplemented by other data sources, most notably sample surveys. To ensure that the resulting statistics are of sufficient quality, it is necessary to evaluate their accuracy – in particular, their variances. In general, estimating the variance of an estimator based on combined administrative and survey data is not a trivial task. In this paper, a generic bootstrap method is described for this purpose. As a running example, we consider variance estimation for frequency tables in the Dutch virtual population Census, where mass imputation may be used to estimate missing values of educational attainment.

Towards an integrated view of modernisation models 077 Marina SIGNORE, (Email) Istat, Rome, Italy Several reference models have been developed in order to support the modernisation of official statistics: the Generic Activity Model for Statistical Organisations (GAMSO), the Generic Statistical Business Process Model (GSBPM), the Generic Statistical Information Model (GSIM), and a Common Statistical Production Architecture (CSPA). These reference models are maintained by the UNECE and endorsed by the High Level Group for the Modernisation of Official Statistics (HLG-MOS). The paper reports on the implementation of the models in different countries, the benefits and challenges faced in this process. It discusses the need for an integrated view of the modernisation models as a tool to overcome some of the difficulties experimented by countires. Finally, it presents new activities that the Supporting Standards Group will undertake in 2019 in order to facilitate the understanding of the interrelationships between different models and what benefits could be expected by implementing two or more models in a coordinated way.

Wikipedia online activity data for temporal disaggregation of tourism indicators 134 Serena Signorelli, (Email) 1, Fernando Reis, (Email) 2 1 Independent researcher, Bergamo, Italy 2 EUROSTAT, Luxembourg, Luxembourg Wikipedia is a widely known on-line encyclopedia used by people all over the world. People leave digital traces of their interactions with Wikipedia in several forms, such as contributed content to the articles, the history of editions of articles (when and what type of editions), discussions (each article has a discussion page) and the history of access to the articles. There has been previous work on the assessment of the potential use of these digital traces for the production of relevant statistics. The research presented in this paper builds on those previous attempts and consists of the use of Wikipedia page views data for the temporal disaggregation of tourism indicators. These indicators are available at Eurostat in tourism statistics broken down by NUTS 2 region at annual level and at monthly level for the whole country. There is a policy need for having these indicators broken down simultaneously in space (NUTS 2) and in time (monthly), in order to obtain values at monthly level for each NUTS 2 region. This spatio- temporal disaggregation could be done simply assuming independence between space and time; however, that is not a reasonable assumption as regions tend to have different tourism profiles. One way to do a more accurate spatio-temporal disaggregation of tourism indicators is to use an auxiliary variable which captures these different temporal profiles between regions. Data on the consultation of Wikipedia articles related to tourism points of interest (e.g. monuments, cultural sites) by tourists when planning their trips has the potential to capture the differing temporal profiles of the various regions. The aim of the research is to synthesise the relevant signal in Wikipedia page views data into one (or a few) indicator(s) and then use it (them) to perform the spatio-temporal disaggregation of tourism indicators.

A data-driven approach to urban digitalization 189 Giuseppe Sindoni, (Email), Silvia Castellanza, (Email) Comune di Milano, Milano, Italy The Milan Municipality’s digital plan is inspired by the strategic goals of the Council and of the National Agency for the digitalization of Public Administration (AGID). The strategic goals include technical innovation, transparency, public involvement, development, liveability and sustainability. The strategic vision for IT that supports the above activities is based on the following principles: • Information sharing in “open data” mode • “Mobile first” approach • Safe, scalable and harmonized IT architectures • Well-defined application interfaces • Full compliance with cyber-security standards • Adoption of enabling platforms

AnalevR: An Interactive R-Based Analysis Platform as a Service for Utilizing Official Statistics Data in Indonesia 049 Erika Siregar, (Email), Aris Prawisudatama, (Email) BPS-Statistics Indonesia, Jakarta, Indonesia The process of making decisions, drawing conclusions, and estimating outcomes require fast and easy access to up-to-date and reliable information. As Indonesia’s national statistical agency, BPS-Statistics Indonesia produces a massive amount of wide-range strategic data every year. The use of these official statistics data has expanded to non-government groups such as researchers, students, and businesses. However, these data are still underutilized by the public due to technical limitations (lack of skilled and experienced employees) and raw data exclusivity and locality (distributed and separately-stored microdata). Other issues such as bureaucratic procedure, long waiting time, and the prices to buy the data have also contributed to worsening the situation. In line with Indonesia’s National Bureaucratic Reform Program, BPS-Statistics Indonesia aims to create innovations in public services by modernizing the way people benefit from the data. We introduce AnalevR, an online R-based analysis platform for accessing, analyzing, and visualizing official statistics data for free without having to own the original microdata. Data and analysis modules are put in a cloud storage and can be explored via the menu provided. The analysis is performed inside the workspace, either using Graphical User Interface (GUI) mode (menu and dialog) or non-GUI mode (syntax editor). AnalevR executes the R-based codes remotely and displays the result in the output container. We use R as the underlying engine because it has a complete collection of libraries for analysis and visualization compared to other languages. All user-defined variables and functions are automatically saved in the workspace for future use. AnalevR comes with many benefits: 1) All-in-one concept - support acquiring, wrangling, analyzing, and visualizing data, 2) Increasing efficiency by reducing the time lag between data request and analysis, 3) Flexibility - free access, easy to operate, multiple workspace, support both R users and non-R users, 4) Sustainability - support user collaboration. AnalevR is built upon a variety of technologies: R for Server, ReactJS and PHP for Client, and Redis and Webdis for Message Broker. These technologies give AnalevR the ability to provide better service and performance compared to similar tools that only use single technology. AnalevR is currently at an experimental stage and the prototype is up and running on http://simpeg.bps.go.id/analev-r This project is open source with code available on https://github.com/erikaris/analev-r. We believe that AnalevR will also be of interest to other countries that suffer from the same data underutilization problems regarding the bureaucracy and regulation issue to obtain the data. For future work, we plan to make it compatible with other languages such as Python and Java. This innovation will raise user involvement in employing BPS’ data, promote the use of R, and ultimately increase statistical quality in Indonesia.

Statistics' dissemination metadata 195 Josep Sort, (Email), Josep Jimenez, (Email), Estela Tonzan, (Email) Institut d'Estadística de Catalunya (Idescat), Barcelona, Spain Metadata for statistical dissemination In response to the review-s feedback to Submission ID 26: Thank you for reviewing our abstract sent. However, we did not understand the comments in the review’s feedback “not clear why not to use search engine services for data extractions”. This communication does not deal with data extraction but about improving usability and browsing on statistical websites through the use of new specific dissemination metadata. Search engine tools may solve data recovery very well mainly when the user knows what kind of information is looking for. The problem we are trying to solve is how to guide a user who does not know what kind of data exists on a specific topic. Statistical websites have not solved this issue because of its complexity. We believe that classic methods of documentation and archiving applied to metadata can be used to improve this problem. But not only that. Websites must also be organized with different levels of information so that the user can quickly find what is looking for, from a specific indicator to the whole statistic database It is as important to create good statistics as to facilitate its use for ordinary citizens. And this is our main objective to solve. We’d like you to considerer this abstract again. We think that the organization and management of statistical websites should be more present in NTTS. Keywords: Metadata, statistical dissemination, website management, API, web usability, users’ statistical needs. 1. ABSTRACT The paper proposes a way of transforming an official statistical website (which provides a means of access to statistical results designed to be primarily understood by expert users) into a new website model to satisfy the information needs of all kinds of users. The innovation adopted has been to create an additional tier of metadata on top of the existing metadata system. In other words, new specific metadata have been added to the statistical data for publishing the results on the website (dissemination metadata). These metadata assist in managing the distribution of the statistical data over different tiers of use, in different products and displayed on the website in accordance with the users’ needs. The dissemination metadata, which are automated, guarantee better management, better web usability and increase user satisfaction. We couldn't include the whole abstract sent because of the additional note included now in the begining of this text.

Well-being indicators for national and local policies in Italy 039 Maria Pia Sorvillo, (Email) 1, Stefania Taralli, (Email) 2 1 Istat, Rome, Italy 2 Istat, Ancona, Italy The paper aims to present the state of the art of well-being indicators used in connection with the Italian economic policy, and possible extensions to the programming documents of local governments. The starting point is the BES project (from the Italian acronym Benessere Equo e Sostenibile - Equitable and Sustainable Well-being) that aims to measure well-being in Italy and for its regions, developed at Istat since 2010. It takes into account not only the well-being levels, but also its distributional aspects and equity, in addition to conditions needed to preserve at least the same level of well-being for next generations, i.e. sustainability. The project is in line with international experiences and with recommendations stated by the Stiglitz, Sen, Fitoussi Report on the Measurement of Economic Performance and Social Progress. Since the beginning, the BES project had two aims: to inform all stakeholders about the state and the evolution of well-being at the national and regional level; to build a set of measures, with the quality level granted by official statistics, to support the policy cycle. The first goal was achieved setting up a measurement framework including 12 domains, for material well-being and for other aspect of quality of life, illustrated by means of about 130 indicators, and an annual report that is now at its fifth edition. The second one found a first implementation in 2016 with the approval of a law reforming the Budget Law, establishing that well-being indicators have to be considered in the economic policy process with an analysis of recent trends and through simulations of the expected evolution in a trend and a policy scenario. Starting from the theoretical framework defined to measure well-being at national level, and in a consistent way with it, in 2011-2012 Istat launched two projects to measure the BES at local level: “Provinces’ BES” and “UrBES”. In both cases, the goal was to identify and measure the most suitable indicators to address some specific issues concerning the well-being assessment at sub-regional and local level, as a support for policymaking.

Outlier Detection Methods for mixed-type and large-scale data like Census 131 Frantisek Hajnovic, (Email), Alessandra Sozzi, (Email) Office for National Statistics, London, United Kingdom Outlier detection (OD) refers to the problem of finding patterns in data that do not conform to expected normal behaviour. OD has been a widely researched problem and finds immense use in a wide variety of application domains. In this paper we consider the domain of building automated OD methods for quality assure the next 2021 UK Census. The scale and nature of such dataset pose computational challenges to traditional OD methods. In general, the scale of the full Census is too large for a sequential execution of the OD methods. Most of the methods scale super-linearly with the size of the dataset and need either a distributed implementation or separate runs of the algorithm on chunks of the dataset. Additionally, Census questions are of mixed type (numeric, categorical, ordinal, free-text and date) and detecting outliers in this multi-dimensional space is an open area of research with no optimal solution yet. Experiences from previous census processing show that it is easy to be overwhelmed with data quickly, and a mechanism for pointing in the right direction will save huge amounts of time and improve quality where it is needed most. It will also help minimize the risk of serious errors, identifying them earlier. This work is being carried out and will culminate with the development a set of lightweight tools to be ready to test on the mid-2019 UK Census rehearsal. Ultimately, these could be run against the full-scale 2021 Census data in a distributed fashion to automatically flag anomalous observations in the dataset. The up to date results of experiments with such methods will be the main focus of the presentation.

On the Operational Definition of Homogeneous Products in Transaction Data 154 Olivia Ståhl, (Email) Statistics Sweden, Stockholm, Sweden Statistics Sweden has been using transaction data in its monthly production of the Consumer Price Index since 2012. Since then, more transaction data of various forms have been introduced, and are continuously being introduced, into the Swedish CPI. As a result, it has become increasingly evident that methodologies need to evolve in order to make the best use of this new type of data. One of the more urgent matters is the need for a unified treatment of the concept of homogeneous products. The impact that a particular price change in transaction data has on an index is often highly dependent on how the distinction between homogeneous and heterogeneous products has been made in practice. The method used for partitioning transaction data into homogeneous products should therefore, in our opinion, be considered one of the most important features when discussing comparability between countries. In this paper we will report on preliminary results from work currently being done at Statistics Sweden in this area.

Estimating Enterprise Characteristics from Web Data: Achievements and Future Developments 048 Monica Scannapieco, (Email) 1, Peter Struijs, (Email) 2, Galya Stateva, (Email) 3 1 ISTAT, Rome, Italy 2 CBS, Hague, Netherlands 3 BNSI, Sofia, Bulgaria Internet is one of the most interesting Big Data sources for Official Statistics. Indeed, while for other sources, like mobile phone data or smart meters, there is the need to engage partnerships with their providers, Internet data are publicly accessible. Internet as a Data Source (IaD) data can be used in substitution or in combination with data collected by means of traditional survey-based instruments. In case of substitution, the aim is to reduce respondent burden, in case of integration the increase in accuracy of the estimates is the main goal. Among the possible uses of IaD, data from enterprise websites are particularly relevant for Official Statistics. During the last few years, the vast majority of enterprises acquired an Internet domain in order to set up an official website, thus making available (almost) for free several information that previously was available only via traditional collection systems. Hence, it is recognized as an opportunity for National Statistical Institutes to collect and to mine the publicly available information on these websites to describe a wide range of phenomena in near real-time. Given this context, the ESSnet Big Data Pilots I was launched by Eurostat early 2016 and is concluded in July 2018. Within such a project, the purpose of workpackage “Web Scraping of Enterprise Web Sites” was to investigate whether web scraping, text mining and inference techniques could be used to collect, process and improve general information about enterprises. The project will have a follow-up, namely the ESSnet Big Data Pilots II that includes again a specific workpackage, “Enterprise Characteristics”, aiming at conducting to an implementation stage the piloting activities carried out within the first ESSnet project. In this paper, we first summarize the results achieved within the first project strand, then, we will highlight the main developments foreseen for the future project activities.

Using Big Data for Official Statistics: Web Scraping as a Data Source for Statistical Business Registers (SBRs) 072 Donato Summa, (Email), Gianpiero Bianchi, (Email), Monica Consalvi, (Email), Barbara Gentili, (Email), Flavio Pancella, (Email), Francesco Scalfati, (Email) Istat, Rome, Italy The approach of the Italian National Institute of Statistics (Istat) with respect to the new complexity of both phenomena and data has been to adopt new strategies to integrate data from traditional surveys, administrative bodies and innovative sources such as Big Data. The aim is to reduce the statistical burden on respondents while enriching the offer, the quality and the timeliness of the information produced, always having in mind that statisticians working in a NSI should be at the same time researchers but also producers, and should always guarantee the quality of official statistics. Accordingly, a project was launched for the enlargement of the informative content of the SBR to provide concrete support for statistical production, taking advantage of the opening of the new Istat Laboratory for Innovation (LabInn) that provides useful infrastructures to strategic research projects in a dedicated physical and technological space. To proceed in a structured and integrated manner, a register-based approach to the Big Data was chosen, placing the register at the centre. The main idea was to use Big Data as an additional source in the SBR updating process, through web scraping and text mining technologies, with the aim of integrating the ‘structured’ business data with the ‘unstructured’ data coming from web pages. Furthermore the new information on enterprises will be used to start a more detailed statistical analysis, finding new classifications and new taxonomies to support a better interpretation of new emerging economic phenomena.

Implementation of the Statistical Territory Register 083 Eduard Suñé, (Email) 1, Anna Bernaus, (Email) 1, Daniel Ibáñez, (Email) 1, Roser Condal, (Email) 2, Mireia Farré, (Email) 1, Cristina Rovira, (Email) 1 1 Statistical Institute of Catalonia (Idescat). General Sub-Directorate of Production and Coordination, Barcelona, Spain 2 Statistical Institute of Catalonia (Idescat). General Sub-Directorate of Information and Communication, Barcelona, Spain The Statistical Institute of (---), is constructing a production system, mainly based on administrative registers made up of three different subsystems: The Statistical Population Register (REP), the Statistical Territory Register (RET) and the Statistical Entity Register (REE). The REP and REE subsystems manage all the events which affect the microdata studied (population in the case of the REP and companies and entities in the case of the REE). These microdata contain information on postal addresses, owing their geolocation. The Statistical Territory Register (RET) is a spatial information subsystem whose main purpose is the geolocation and validation of the postal addresses which appear in the other subsystems. In this paper, we will outline the geocoding and validation processes performed by the RET and describe the methods for the quality control of the basic information permitting the validation of postal addresses.

Estimation of the number of post-Soviet foreigners in Poland in 2015 and 2016 using capture-recapture methods 151 Marcin Szymkowiak, (Email) 1, 2, Maciej Beresewicz, (Email) 1, 2 1 Poznan University of Economics and Business, Poznan, Poland 2 Statistical Office in Poznan, Poznan, Poland Abstract text is uploaded in pdf file.

Small area estimation of the LFS-based monthly unemployment rate in Poland 158 Marcin Szymkowiak, (Email) 1, 2, Maciej Beręsewicz, (Email) 1, 2, Tomasz Józefowski, (Email) 2, Kamil Wilak, (Email) 1, 2 1 Poznan University of Economics and Business, Poznan, Poland 2 Statistical Office in Poznan, Poznan, Poland Abstract was uploaded as pdf file.

VAT Tax Gap prediction: a 2-steps Gradient Boosting approach 202 Giovanna Tagliaferri, (Email) University of Rome, Rome, Italy Tax evasion represents one of the main problems in modern economies because it results in a loss of State revenue. The aim of this project is to provide an estimate of the Italian VAT Tax Gap for the year 2011. Analysis is performed using a bottomup approach based on compliance controls and estimation is pursued via machine learning techniques. The observed data have been taken from two sources: the register of Irpef1, VAT and Irap2 declarations (available on all units, actual tax revenue due unknown) and the compliance control papers, performed only on a non-random sample of units ([1]) (assessed units, actual tax revenue due known). One of the main problems of this analysis is related to the non-randomness of the compliance controls, that induce a selection bias on the observed sample. The final target of the analysis is to get trustful estimates for the undeclared tax base of the unassessed units. However, our model will focus on the estimation of the potential tax base (BIT) and the undeclared part will be derived as a difference: BIND = BIT

Mining Big Data for Finite Population Inferences 007 Siu-Ming Tam, (Email) Australian Bureau of Statistics, ACT, Australia

In this paper, it is shown that Big Data sources, which are generally known to suffer from coverage bias, can be used to integrate with the data from probability samples for efficient finite population inferences. The methods developed in this paper can be extended to address situations where the variables in the Big Data, or in the probability sample, suffer from measurement errors, and also when there is unit non-response in the probability sample. We will demonstrate the efficacy of the methods using simulation.

ClairCity: official statistics as an enabler in a citizen-led European air quality project 042 Olav ten Bosch, (Email), Dick Windmeijer, (Email), Alex Priem, (Email), Wiet Koren, (Email), Martijn Tennekes, (Email) Statistics Netherlands, The Hague, Netherlands ClairCity is a four year European project (2016-2020) working directly with citizens and local authorities in six countries around Europe with the aim to improve air quality. Nine research institutes, six municipalities and one national statistical institute – Statistics Netherlands - work together on models, scenarios and policies to help cities decide on the best local options for a future with clean air and lower carbon emissions. This paper explains the ClairCity project in brief and highlights the enabling role of official statistics in this international, multi- disciplinary project. Six cities are partners in the project; Amsterdam in the Netherlands; Bristol in the UK; Ljubljana in Slovenia; Sosnowiec in Poland; the Aveiro region in Portugal and the Liguria region around Genoa in Italy. Each city faces different issues and causes of air pollution, but all of them are working to improve their air quality. From an official statistics point of view this creates an interesting challenge. Although local circumstances differ, a generic model based approach is applied which asks for good quality data on demography, citizens behaviour, traffic patterns, energy (both production and consumption) and other aspects that may influence local air quality now and in future. A key element of the ClairCity approach is to use existing data to drive the modelling activities. It uses data that is already collected by local, national and European institutes to develop new models of urban air pollution and carbon emissions and scenarios to reduce emissions in future. Although initially developed for the six participating pilot cities / regions, the outputs will be generic so that they can be reused in every European city with over 50,000 residents. This will make it easier for cities to identify changes that they can make to reduce emissions and make a positive change in peoples’ lives. A consequence of this approach is that the project has a very broad and extensive data demand. Moreover, the data needed is usually scattered around multiple data sources and usually not very standardised. That is where official statistics come in. Both the identification of data, the management of multiple - maybe incompatible - data sources and their metadata as well as the visualization of scenario effects are fields where official statistics can - and should - add their knowledge and experience.

Reproducible maps for everyone 150 Martijn Tennekes, (Email) Statistics Netherlands (CBS), Heerlen, Netherlands Maps are important in official statistics. Statistical researchers and data scientists use maps to explore and analyse spatial data, to share their findings, and to communicate statistical output to the general public. Traditionally, maps are made with specialized desktop GIS tools, such as ArcGIS, QGIS, and GRASS. Although the set of features and functionalities of these tools are impressive, there are some main drawbacks, especially when it comes down to reproducing maps. During the last decade, R has become a very powerful and mature alternative for spatial data analysis and visualisation, with the advantage of reproducible scripts. Moreover, working with spatial data has become much easier; you don't need a master in computer science to create maps with R. In this presentation, we illustrate how easy mapping has become using some state-of-the-art R packages.

Semantic Modeling of Official Statistics - The case of the Greek Statistics 144 Stamatios Theocharis, (Email) Ministry Of Interior, Athens, Greece Univrersity Of Piraeus, Piraeus, Greece The institutional and legal framework of the Hellenic Statistical System - ELSS in combination with the huge number and complexity of the statistics produced by its stakeholders, requires full support from computing systems through the automation of the information retrieval and knowledge management. To this end, we will examine in this paper the possibility provided by the semantic web software tools. In particular, in order to model the field of ELSS, we developed an OWL-ontology using the Protégé, and present the search results on the data of the corresponding knowledge base that we created in parallel with the ELSS ontology.

Big Data sentiment analysis in the European Bond Markets 155 Luca Tiozzo, (Email), Sergio CONSOLI, (Email) JRC - Joint Research Center - European Commission, Ispra, Italy The surge in the Euro area yield spreads has fueled in intense debate about their determinants and the sources of risk. Over the last months, financial markets have been concerned about the possibility that the new Italian government will not be able to undertake important economic reforms. Similarly, in 2017, financial investors were suffering from unclear economic policies perspective during French's presidential campaign. In both cases, investors sentiment about countries economic prospective deteriorate, producing an increase of national interest rates with respect to their German counterpart. Moreover, during the French election period, bad mood propagated to other European countries with not solid fiscal fundamentals (i.e. Spain and Italy). Therefore, it is clear that financial investor's sentiment plays an important role in determining interest rates dynamics within a country but also across countries. The main contribution of this paper is threefold. First, we capture financial investor's sentiment by using textual information in newspaper. We exploit the new Big Data dataset, Global Database of Events, Language and Tone (GDELT) and construct news based sentiment and uncertainty indexes for the different Euro area countries. The aim is to test the hypothesis that an increase in political uncertainty, i.e. unclear political guidelines, may cause deterioration on domestic investor’s sentiment with a consequent rise on national interest rates. Second, we will introduce a new approach to account for possible spill over of political risks in different economies of the Euro area. In contrast to the existing literature where uncertainty measures assess domestic media perception of domestic news (Bloom, 2014), here we will study also how domestic investors, through the lens of domestic media coverage, perceive facts happening in other European countries. The objective is to understand when worries in a country may be transmitted to another European country and how this mechanism could shape the behaviour of investors and the dynamics of yield spreads in the European bond market. Finally, following Andritzy (2012) and Gennaioli, Martin and Rossi (2014) we will evaluate to what extent interest rate dynamics determined by our new sources of risks would affect the composition of banks’sovereign bond portfolios in the Eurozone.

On robustness of the supervised multiclass classifier for autocoding system 065 Yukako Toko, (Email) 1, Shinya Iijima, (Email) 1, Mika Sato-Ilic, (Email) 1, 2 1 National Statistics Center, Tokyo, Japan 2 University of Tsukuba, Ibaraki, Japan We developed a supervised multiclass classifier for autocoding in our previous studies. The purpose of this paper is an investigation of robustness of our classifier in order to practically apply this in coding tasks in the field of official statistics. Text response fields such as fields for occupation, industry, and household income and expenditure, are sometimes found on survey forms in the field of official statistics. Those responded text descriptions are usually translated into corresponding classification codes for efficient data processing. Although, originally, coding tasks are performed manually, the importance of automated coding is increasing with the improvement of computer technology in recent years. Therefore, studies focused on developing an algorithm for autocoding have been seen in the field of official statistics. For example, Hacking and Willenborg (2012) introduced coding methods including autocoding techniques. Gweon et al. (2017) illustrated some methods for automated occupation coding based on statistical learning. As mentioned above, we also developed a supervised multiclass classifier to apply this to the coding task of the Family Income and Expenditure Survey in Japan. Originally, our classifier was developed based on simple machine learning technique, and it performs exclusive classification (Toko et al., 2017; Tsubaki et al., 2017; Shimono et al., 2018). However, the classifier incorrectly assigns classification codes for some objects with ambiguous information because of the semantic problem, interpretation problem, and insufficiently detailed input information. As we found that the main reason for these problems are unrealistic restriction that one object is classified to a single class, we developed a new classifier that allows for the assignment of one object to multiple classification codes with calculation of new defined reliability scores utilising the idea of our previously proposed algorithm based on partition entropy (Toko et al., 2018 (a); Toko et al., 2018 (b)). Although we improved the classification accuracy of our classifier in our previous studies, to apply the classifier in a practical situation, we should consider not only the classification accuracy but the robustness of classification. A classifier for the autocoding system requires robustness for the stable code assignment, whereas the style of text description is not always stable even in the same survey as it depends on respondents. This study investigates the robustness of our classifier with a numerical example using the noise-added survey data.

Re-identification risk in mobile phone data 093 Fabrizio De Fausti, (Email), Roberta Radini, (Email), Luca Valentino, (Email), Tiziana Tuoto, (Email) Istat, Roma, Italy The use of mobile phone data (MPD) for official statistical purpose is increasing. MPD are very useful for statistics related to population, migration, mobility. For instance, in developing countries, MPD can provide update estimates of population density, in the absence of other sources. In developed countries, often population registers are available, however, MPD provide more timely and detailed information, describing habits and behaviors that are not reported in the registers but they are also important for policy-making etc., e.g. human mobility, population density at a given time-space. Moreover, MPD can be used as auxiliary information for topics like poverty, SDG indicators. Finally, MPD allow evaluating the coverage of the population registers. However, the utility of MDP should be balanced by the risk for privacy violation of personal data. In fact, even if MPD are provided without direct identifiers (e.g. name, surname, date of birth, address, personal tax code) we cannot state they are anonymous. Several works claimed that it is possible to isolate a subject in a MPD database or link the MPD to subjects in different databases, or deduce, with significant probability, a characteristic of a subject from the MPD. So, MPD should be considered as personal data according to the GDPR, with an evaluation of the risk of re-identifying a person, even if personal data has been de-identified, encrypted or pseudonymised. Hence, to allow using MPD in a privacy preserving framework, a data protection impact assessment should be evaluated. This means to describe the planned processing operations, to assess the risks to privacy, to plan the measures to address those risks. In this work, we investigate the risk of privacy and provide statistical measures. In cooperation with a mobile phone provider, we apply our investigation to real data. We focalize on the usage of MPD in an NSI and on the privacy attacks and privacy risks that are likely to occur even when MPD are provided without identifiers. In particular, we devote our attention to the case in which the external knowledge comes from statistical population registers and employees-employers databases, and these can be compared to the MPD in order to identify single user. Hence, we consider privacy attacks, defined as “Home and Work Attack” where an intruder knows the two most frequent locations of an individual and their frequencies. We provide measures for the privacy risk, that are the first step in oder to prepare actions for usages of MPD in official statistics in a privacy-protected environment. Risk assessment is one of the fundamental elements to define the processing of data and the integration with security policies and privacy protection: this is the newness of the principle of privacy by design.

A generic data-api for implementing GSIM: Linked Data Store 160 Brynjar Ulva, (Email) Statistics Norway, Oslo, Norway A national statistic office is modernizing its systems for statistical production. A key standard in this field is the GSIM (Generic Statistics Information Model). (1) In modernization of a national statistic office the information platform is a key component. The platform is intended to logical store all metadata and data; however, we have failed to discover a reusable implementation that fits the envisioned microservice architecture. In this paper we introduced the Linked Data Store (2) The Linked Data Store presented in this paper is designed and developed to support storage of structured metadata and data based on the GSIM standard. in which forms the basis for the Logical Data Model — the concrete domain specific model to be used in national statistic office In Chapter 2 we discuss the drivers, motivation and main requirements for developing LDS. In Chapter 3 we will present the Linked Data Store. In chapter 4 we present the conclusion and further work.

Engaging with users to modernise the dissemination of European statistics 183 Julia Urhausen, (Email), Maja Islam, (Email) Eurostat, Luxembourg, Luxembourg Modernising the Eurostat website and dissemination products is driven by the objective to better respond to users’ needs and to facilitate access to official statistics. Following current trends, the aim is to be more visual and attractive and also to provide more structured and precise texts replying to the most common user questions. Thus, engaging with users serves as the fundament and impetus for any changes in this modernisation process. In 2017, several user research activities were launched at Eurostat as part of the DIGICOM project – an ESS project aiming to modernise the dissemination and communication of European statistics. The aim of these user research activities was to learn more about our users and their needs, and get recommendations on what we can do to modernise the dissemination of European statistics. Two qualitative research methods were used: field studies and usability tests. During the field studies, five personas of users of European statistics were identified and will be presented. These personas are valuable in the whole process of developing products for statistical dissemination. To demonstrate this, results of the usability studies of different dissemination products will be shown and are accompanied by concrete examples of user feedback and its translation into improved dissemination services.

Estimating unmetered photovoltaic power consumption using causal models 051 Jan van den Brakel, (Email) 1, 2, Bart Buelens, (Email) 1 1 Statistics Netherlands, Heerlen, Netherlands 2 Maastricht University, Gangelt, Germany Energy accounting encompasses the compilation of coherent statistics on energy related issues in countries, including the production and consumption of electricity. A complete picture of demand and supply of electricity must include data on electricity production outside the energy industries, such as electricity produced by domestic photovoltaic (PV) installations. These PV installations are rarely metered by distribution net operators, hence, their production remains invisible to statistical agencies responsible for the energy accounts. Consequently, the renewable electricity production is di-cult to estimate while monitoring it is crucially important for climate policy evaluation. In the Netherlandsthe country studied in this articlean incomplete register of PV installations is available. Such registers can be used to estimate power produced by PV installations using a modelling approach relating installed capacity to produced electricity. In the present article we propose inferring solar power production from causal relations between solar irradiance and consumption of grid power. Since the production of solar power by domestic PV installations results in a reduced consumption of electricity from the high-voltage grid the combination of time series of electricity exchange on the high- power grid and series of solar irradiance contain a hidden signal of unmetered solar power produced by domestic PV installations. In this paper a causal model for these time series to estimate unmetered solar power production is developed on a daly frequency. Final analysis is based on ARIMAX models. Our estimates are compared with the offical statistics on produced solar power published by Statisitcs Nehterlands. These estimates are based on an incomplete register of PV installations and assumptions about their average capacity. We conclude that our model estimates are in line with these official statistics. While official statistics are at annual level, our modelling approach produced daily estimates. In contrast with the regular official statistics, no administrative or survey data on PV installations in the country was required. Hence, the proposed model can be applied easily, quickly and widely, and could be particularly useful in countries where no good estimates of unmetered PV electricity are available yet.

Systematic data cleaning using R 071 Mark van der Loo, (Email), Edwin de Jonge, (Email) Statistics Netherlands (CBS), Den Haag, Netherlands Over the last decades there have been several attempts to set up frameworks for statistical data processing and statistical data cleaning. One of the key notions is that a data cleaning procedure can be decomposed into a sequence of fundamental steps, where each step is controlled by external information defined by experts. In this model, some imperfect data set is input for a processing step. The processing step is generally parameterized by two types of metadata. First, a set of validation rules describe the desired ultimate state of the data set. Second, there are parameters that control the details of process. For example, if the processing step concerns an imputation procedure an imputation model specification may enter as a parameter. The process then yields an improved dataset while keeping a log of its activities in a that can be used for monitoring. In this presentation we demonstrate how a set of tools build in R can be flexibly combined to follow precisely this model.

Exploring and integrating data from digital farm reporting, smart surveying, crowdsourcing and Sentinels 178 Marijn van der Velde, (Email) European Commission, Joint Research Centre (JRC), Ispra, Italy In the coming years, several new and non-traditional data collection approaches will evolve to become complementary to traditional sources of data. In this presentation we will highlight some of these developments and give examples relevant to the agricultural domain. We will focus on exploring and integrating data from digital farm reporting, smart surveying, crowdsourcing, and Sentinels. In rural areas, digital agriculture will develop further and benefit from Copernicus Sentinels data streams, GNSS, rural digital networks, integration with information collected from farm management systems, machine sensors, and third-party collected information. New approaches to collect in-situ data complementary to the high spatial and temporal resolution of Copernicus Sentinel satellite observations are needed. For instance, increasingly open access to digital agricultural parcel registration systems and targeted smart ground-surveying can provide high quality and timely in-situ data for training and validation. Non-traditional approaches such as active or opportunistic crowdsourcing have the potential to become sampling tools that are complementary to traditional approaches such as LUCAS. At the same time, citizen science approaches have been very successful in gathering relevant data, but also in raising wider public awareness, and inspiring successful participatory approaches to governance and decision making. Assessing the robustness and reliability of such non- traditional sources now is key. To provide useful datasets, citizen science and crowdsourcing activities require the implementation of unambiguous collection and quality assurance procedures. Fostering knowledge exchange, innovation, and digitalization in rural areas is crucial to improve environmental and climatic performance of European farms. Digitalization can change farming for the better, by making better use of inputs, by deploying autonomous and precise machinery, and can even result in changes in the supply chain. Whereas reporting for CAP and greening requirements is often seen as a considerable burden to farmers, there is considerable, and largely under exploited, potential in deriving relevant indicators from digital farm management tools. The drive towards simplification in this context opens a window of opportunity for both farmers and authorities to consider novel ways of information exchange.

Evaluating multilateral index methods on scanner data 101 Ken Van Loon, (Email) Statistics Belgium, Brussels, Belgium Statistics Belgium has been using scanner data from supermarkets in the calculation of the CPI since 2015. The applied method is a version of the so-called “dynamic method” using an unweighted chained Jevons index. Unweighted means that the turnover information at the product level is currently not explicitly used. Incorporating the available turnover information explicitly into chained monthly index calculation (e.g. superlative formulae such as Törnqvist) leads to chain drift. However, using such turnover information could lead to a more representative index calculation therefore methods have been proposed that make it possible to use this information while calculating drift free indices. These methods are called multilateral methods, because they use information from more than two periods. These multilateral methods (GEKS-Törnqvist, Time Product Dummy, Geary-Khamis and augmented Lehr index) are evaluated and compared with the dynamic method. These comparisons and evaluations will be presented. It will be shown that the differences between the methods aren’t that large apart from the augmented Lehr index. To calculate non-revisable indices rolling windows have to be used together with various splicing and extensions options. These methods are also evaluating by applying them on scanner data. It will be shown that some of these options might still cause some drift. A final issue that will be highlighted is how product relaunches (e.g. same product smaller content) are dealt with. We currently combine text mining with manual verification, more efficient ways of creating homogeneous product groups will be examined. The results will show that to handle relaunches good metadata is necessary.

New weighting methods, including explicit correction of sampling weights for non-response and attrition, in the reformed Belgian Labour Force Survey 057 Camille Vanderhoeft, (Email), Astrid Depickere, (Email), Anja Termote, (Email) Statistics Belgium, Brussels, Belgium In 2017, Statbel (Statistics Belgium) introduced a major reform for the Labour Force Survey (LFS): after 18 years of working with a continuous survey, the switch was made to a panel survey. The most important aspects of this reform are: • The transfer to an infra-annual (“quarterly” for the Belgian case) rotating panel design. A sample (called rotation group (RG)) of private households is drawn each quarter, independently of previously drawn samples. The sample is rotating in the sense that any specific rotation group stays in the survey during 18 months (6 quarters), after which it is replaced by a new rotation group. In the panel survey, each member (of at least 15 years old) in a selected household is asked to complete a questionnaire four times, i.e. during four waves, according to a 2(2)2 scenario: a selected household/individual is asked to complete a questionnaire during two consecutive quarters (wave 1 and wave 2), is then not in the survey during the next two quarters, and is again asked to complete a questionnaire during the next two quarters (wave 3 and wave 4). • The introduction of mixed mode data collection techniques. In the first wave and after an introductory letter, the selected households are contacted by an interviewer and CAPI is used to collect the data. In the three follow-up waves, data can be delivered through CAWI or CATI, according to the household’s preference. • Application of the wave approach. Information on structural variables is gathered in the first wave only; information on core variables is collected in all four waves. • A revision of the weighting methods. More attention is paid to the correction of effects of non-response (in the first wave) and panel attrition (in the follow-up waves). This resulted in a two-step weighting approach: in step 1, response probabilities are estimated through a mixed effects logistic regression model, and aggregates of the estimated probabilities are used to correct the sampling weights; in step 2, the corrected weights from step 1 are calibrated to the population of interest. The present text is focussing on the latter aspect of the reform, i.e. the weighting methods. We show the effect of changing the weighting method by comparing estimates for various LFS indicators based on the new 2-step approach, with estimates based on the old 1-step approach which was used for the continuous LFS from 1999 to 2016. We argue that the new method is better correcting for non-response and attrition bias. Furthermore, it will be shown that different changes in the new methodology are, to some extent, cancelling out, causing only moderate breaks in time series for some major indicators.

Hidden Markov Models to Estimate Italian Employment Status 141 Roberta Varriale, (Email), Danila Filipponi, (Email), Ugo Guarnera, (Email) Italian National Institute of Statistics - ISTAT, Rome, Italy The increased availability of large amount of administrative information at the Italian Institute of Statistics (Istat) makes it necessary to investigate new methodological approaches for the production of estimates, based on combining administrative data with statistical survey data. Traditionally, administrative data have been used as auxiliary sources of information in different phases of the production process such as sampling, calibration, imputation. In order to take into account deficiencies in the measurement process of both survey and administrative sources, a more symmetric approach with respect to the available sources can be adopted. A natural strategy, according to this approach, is to consider the target variables as latent (unobserved) variables, and to model the measurement processes through the distributions of the observed variables conditional on the latent variables. In this context, Latent Class Analysis (LCA) is traditionally considered as a method to identify a categorical latent variable using categorical observed variables, which a longitudinal extension is the Hidden Markov Model (HMM). Examples on the use of latent models in Official Statistics are becoming common and several applications can be found in the field of employments research. In this paper we show the use of a latent model for estimating employment rates in Italy using both Labour Force Survey(LFS) and administrative data. Here, the use of HMM is particularly suitable since, as for many (European) countries, employment administrative data are collected on a monthly bases, while the LFS data contains a rotating panel structure The aim is both to show which are the potentiality of this methodology and the possible problems and research topics.

Assessing multi-source processes: the new Total Process Error framework 079 Roberta Varriale, (Email), Fabiana Rocci, (Email), Orietta Luzi, (Email) Italian National Institute of Statistics - ISTAT, Roma, Italy The production of Official Statistics based on a combination of data from different sources has spread out in recent years and all the National Statistical Institutes (NSIs) are developing new strategies in producing the required outputs. The challenge is to move towards processes where the combination of the available administrative data (AD hereafter) should represent as far as possible the primary source, delivering strong and extensive information about the phenomena under study. Many new experiences have delivered new important results, that can be considered at the basis of the modernization that NSIs are facing. In this view, new processes are taking place and new methodological issues are arising. Among others, a key issue to be considered relates to the development of a new quality framework to assess the quality of Official Statistics based on a multi-source process including AD. This paper focuses on this issue, with the final aim to propose an new evaluation framework, the Total Process Error framework, based on a system of indicators, useful to: (i) help practical decisions about statistical design together with monitoring its development (ii) define quality measure/measures of the new processes and of their statistical outputs.

Mixed mode data collection and adaptive survey design for Structure of Earnings Statistics 014 Guy Vekeman, (Email), Pieter Vermeulen, (Email) Statistics Belgium, Bruxelles, Belgium The Structure of Earnings Statistics (SES) aims to report on earnings, the occupation and the attained education level of employees. Micro enterprises, employing less than 10 persons, are excluded from the reference frame. A primary sample includes local units of enterprises whereby a stratified sampling scheme is used. Finally employees are sampled among the selected establishments. The individual inclusion probability for employees is range bound by tuning their required number to the selection probability of the local unit. Very small establishments are drawn with a low probability, yet all employees are selected. An exhaustive survey for large establishments requires a relatively limited sample of employees. Individual employee remuneration and data on working time will now be retrieved from administrative social security data, recently made available. In this mixed-mode data collection, establishments only need to report on some personal characteristics of their workers, like their occupation and education level attained. This information is available at the human resources unit of the employer and they do not need to source in financial data on remuneration. Since the payroll often is managed by an external social secretariat, the reform implies that the involvement of a third party and the corresponding additional cost for the respondent no longer is required. Both the significantly reduced response burden and the simplified survey questionnaire (from 25 down to 5 variables) should contribute to a much lower non-response.

City users and daytime population. An approach with administrative data. 092 roberta vivio, (Email), sara casacci, (Email), stefania di domenico, (Email), maria liria ferraro, (Email) Istat, Rome, Italy This paper presents the first results of a prototypal statistical register on the city users and the daytime population on a territory. The data on the resident population is no longer sufficient to govern the present complexity of the territories, in particular for those attractive ones, for universities sites or those with an economic vocation. Incoming and outgoing flows, daily and periodic mobility, short and long migrations, etc. are a very anthropic pressure that requires services, produces consumption of energy and land. The local government must provide: transport, energy, housing for temporary residents, etc.. Population movements also affect the quality of the social frame both in the territories of departure and arrival. This information is strategic for the optimal sizing of the collective services and the quantification of the housing needs of the cities and their hinterlands, but also for prevention and intervention plans in the event of natural disasters. The questions we are trying to answer with this register are "How many are daily these people?, Who are they?, Where do they come from?"

Login on Smartphones: A triviality? 030 Johannes Volk, (Email) Destatis - Federal Statistical Office Germany, Wiesbaden, Germany There is ongoing discussion about transforming online questionnaires into an appropriate mobile device design in order to avoid mode-effects and safeguard data quality. However, a crucial point in implementing online questionnaires is often neglected or less discussed: The dropout rates of potential participants due to problems in accessing the online instrument caused by a faulty or inconvenient login process. In our qualitative study, concrete evidence on the design of an easy login on Smartphones could be iteratively developed to meet users’ requirements. Results suggest using catchy passwords, displaying passwords rather than fading them and implementing proportional fonts rather than non-proportional. Moreover, we analysed the duration of the login process, the layouts of different keyboards, the design of input fields, the layout of the login information in the invitation letter and the subjective ratings of lead probands on subjective efforts and security issues. In the end, we developed a combination of login design elements, which better suit the user’s perspective. However, further investigation on data security requirement is needed before a new design can be implemented.

Digital process data from the truck toll collection as a new building block of official short-term statistics 111 Michael Cox, (Email) 1, Stefan Linz, (Email) 2, Claudia Fries, (Email) 2, Julia Völker, (Email) 2 1 Federal Office for Goods Transport, Cologne, Germany 2 Federal Statistical Office, Wiesbaden, Germany Economic activity generates and requires transport services - there is a close connection between the economic development in Germany and the freight traffic on German roads. Since the beginning of 2005, a toll has been charged on heavy goods vehicles on federal motorways and later on trunk roads in Germany. As part of toll collection, digital process data is generated, among other things, on the mileage of trucks subject to toll. The data is generated in a combination of mobile technology and satellite positioning system (GPS). A German authority which is responsible for freight traffic has used this data to develop a truck-toll- mileage index, which indicates the mileage for comparable basic characteristics and excludes structural changes as far as possible. Due to its early availability and economic meaningfulness, this index has been included in the publication program of official statistics. This article describes the truck-toll-mileage index as a new element in business cycle statistics and explains its relation to existing short-term statistics. Further, the question is discussed whether the truck-toll-mileage-index can be used as nowcast in the official production statistics.

Sentinel-1 coherence for agricultural statistics 043 Kaupo Voormansik, (Email), Karlis Zalite, (Email) OÜ KappaZeta, Tartu, Estonia First Copernicus satellite was launched in 2014 and by 2018 the Sentinel-1 and -2 systems have reached their operational status with nominal data production volumes. Despite the numerous outreach and promotional events, the data is still underused in the public sector and there is remarkable potential to be employed. Main factors limiting the use are: i) lack of awareness about the capabilities and limitations among the end users; ii) lack of knowledge how to process the data and retrieve reliable information out of it; iii) insufficient applied research / piloting projects to test the academic research results in large scale (e.g. country level) real life conditions to discover and fix the possible problems not evident in the small scale research projects. Agricultural statistics is one of the areas where Copernicus data can help to raise the current methods to the next level. Instead of field surveys sampling based indirect estimates it is now possible to directly measure whole geographical coverage and get virtually total representation. Despite the high resolution satellite data has been available for decades its usage has been limited due to high price, sparse temporal and spatial coverage. Copernicus addresses both of the shortcomings with free and open data policy and unprecedented spatial and temporal coverages. Largest improvements are expected for the applications, which need dense time series, fast updates and temporal process monitoring, where static once a year imaging has not been sufficient. Agricultural statistics seems to be a model example here. Thanks to the long traditions, large and established user community of optical remote sensing the uptake of Sentinel-2 data has been relatively rapid. The reason that optical satellite imagery is easy and intuitive to interpret should not also be underestimated. The usefulness of Sentinel-1 data has been so far undeservedly underestimated, which is best illustrated by the fact that large satellite imagery processing cloud environments like Google Earth Engine and Amazon Web Services don’t even provide Sentinel-1 SLC format data, despite that they are the most information rich data products of Sentinel-1. The reasons behind the very limited usage Sentinel-1 SLC data are likely higher technical complexity, not so intuitive data interpretation and smaller community of radar remote sensing experts among the universities and companies. Still Sentinel-1 is a very valuable complement to Sentinel-2 as it is virtually weather independent (no gaps in the time series due to cloud cover) and it is directly sensitive to the water content of the soil and vegetation cover, which is very important for describing the state of the agricultural landscapes. The abstract introduces Sentinel-1 repeat pass coherence as an important parameter for describing agricultural landscapes. Methods section describes the coherence computation and its meaning for interpreting the resulting coherence images. Results section describes coherence for grasslands mowing detection and discusses its usage and limitations beyond, for other agricultural applications and statistics computation. Conclusion underlines the main benefits with potential applications and gives recommendations for applied research projects to pave the way for operational use.

Time-varying end-of-month effects in German currency in circulation 038 Karsten Webel, (Email), Andreas Dietrich, (Email) Deutsche Bundesbank, Frankfurt, Germany The increasing availability of long economic time series poses a challenge for seasonal and calendar adjustment in practice as seasonal and calendar movements now may exhibit changes which are barely identifiable over shorter time periods. The popular X-11 and ARIMA model- based seasonal adjustment methods are capable of dealing with a fair amount of moving seasonality. The respective pretreatment regression models, however, are based on the assumption of constant calendar effects and, thus, do not allow for a direct estimation of ``moving'' calendar effects. As a compromise, official statistics usually follow a rather pragmatic approach based on dividing the entire observation span into several (potentially overlapping) sub-spans and performing separate seasonal and calendar adjustments on each of those. Still, the key question remains: can calendar effects be generally assumed to stay constant over at least several years? Applying structural time series models, this paper exemplarily studies monthly currency in circulation for Germany as of January 1980 up to February 2018. Since data is reported at the last banking day of a month, this series is likely to be affected by the particular weekday which the last day of a given month falls onto. We refer to this effect as the end-of-month effect and find both smooth transitions and sudden changes in these effects over time, providing empirical evidence against the assumption of constant end-of-month effects.

Administrative Data and Statistical Matching (EU-SILC and Micro- census Environment) 029 Alexandra Wegscheider-Pichler, (Email) Statistics Austria, Vienna, Austria The micro-census special programme "Environmental conditions and environmental behaviour" by Statistics Austria contains widespread data material concerning ecological issues. The influence of income on the collected environmental characteristics is commonly assumed but could commonly not be confirmed because the variable “income” is not part of the micro-census survey. In 2018 a study with data of the micro census environment 2015 was conducted. Most of the income information was generated by linking with administrative data, which increased data validity. For data missing in administrative sources, statistical matching with income components of EU-SILC was conducted. A mixed use of both methods (administrative and matching) seemed to be the most useful way to generate a variable as the total household income for an existing data set. This way the advantages of both data generating possibilities can add up and improve the data quality.

New Experimental Statistics at Istat: the Social Mood on Economy Index 125 Diego Zardetto, (Email), Cristina Fabbri, (Email), Pasquale Testa, (Email), Luca Valentino, (Email) Istat - The Italian National Institute of Statistics, ROMA, Italy Nowadays millions of people all over the world use social media platforms to keep up with the news, to express their feelings and ideas, as well as to share or debate opinions on virtually every conceivable topic. This justifies the interest of National Statistical Institutes (NSI) towards social media as a means for “measuring” the public mood. In recent years, the Italian National Institute of Statistics (Istat) has been investigating whether social media messages may be successfully exploited to develop domain specific sentiment indices, namely statistical instruments meant to assess the Italian mood about specific topics or aspects of life, like the economic situation, the European Union, the migrants’ phenomenon, the terrorist threat, and so on. To this end, Istat researchers have developed procedures to collect and process only social media messages containing at least one keyword belonging to a specific filter, namely a definite set of relevant Italian words. Domain-specific filters have been designed by subject- matter experts with the aim of filtering out since the beginning messages that would very likely turn out to be off-topic for the intended statistical production goal. Istat has recently released a new experimental statistic, based on Twitter data: the Social Mood on Economy Index. The index provides daily measures of the Italian sentiment on the state of the economy. These measures are derived from samples of public tweets in Italian, which are captured in real time. Similar initiatives have been put in place by other NSIs in recent times, notably Statistics Netherlands’ attempts to “mimic” the time evolution of the Dutch Consumer Confidence Index by means of a sentiment index based on social media, and to derive daily measures of social tension from Twitter. This paper provides an overview of the production pipeline of Istat’s Social Mood on Economy Index.

Human-centred Machine Learning through Interactive Visualizations: Reflections on the Design of a Visual Analytics Tool for Criminal Intelligence Analysis 013 Leishi Zhangl., (Email), William Wong, (Email), Neesha Kodagoda, (Email) Middlesex University, London, United Kingdom With the increasing popularity of Machine Learning (ML) in many application fields such as health care, environmental protection and criminal investigation, more and more application analysts are keen on incorporating ML to their existing analytical approaches in order to improve the efficiency and effectiveness [1]. Given an application problem, often a tailor-made ML software product is developed to help addressing the existing challenge in data processing and knowledge extracting. Due to the fact that most of the application analysts are not ML experts, the challenge is to design the tool in such a way that the analyst can easily understand the underline principle behind the algorithms and feel comfortable to use it. In this abstract we introduce our journey through the design of a visual analytics tool as part of the EU-funded project “Visual Analytics for Sense-making and Criminal Intelligence Analysis (VALCRI)”, with the aim of making ML more approachable by domain analysts. A Data Integration System for RIAD Bundesbank 193 Katja Ziprik, (Email) Deutsche Bundesbank, Frankfurt Main, Germany The Register of Institutions and Affiliates Data (RIAD) is a shared register maintained by the European System of Central Banks (ESCB). Each National Central Bank of the Eurozone provides input for RIAD in its own area of competence, whereas RIAD Bundesbank is the corresponding German register. It contains the reference data on legal and other statistical institutional units and facilitates the integration of several databases, especially by allocating common identifiers, attributes and metadata. It will provide a data hub function by integrating and processing multiple internal and external databases, f. e. the Analytical Credit Database (AnaCredit) and the Deutsche Bundesbank Prudential Database. It enables a high flexibility with regards to analysis; hence the collected data can be used for statistical and non-statistical purposes, both across institutions and across different user groups within institutions. At the moment, there are over 1.700 reporting agents from the AnaCredit primary reporting and already three internal registers that are continuously integrated into the system. These data sources differ significantly in their quality and data structures. In the near future, further commercial and official data sources will be gradually integrated. There is no stable national identifier with full coverage available in Germany to describe the respective reporting units. The aim of the integration system of RIAD Bundesbank is to classify (“match”/ “non-match”) and consolidate every received reporting unit on a highly automated basis, since the data throughput is too high to deal with it manually. To fulfil the formulated requirements of the corresponding regulation (ECB/2016/13) and guideline (ECB/2018/16), the produced data quality needs to be very high. Thus the implemented algorithms are precision oriented. For the task of record linkage there are already several deterministic stages implemented. A machine learning based record linkage prototype has been developed to complement the deterministic stages. For the consolidation of the data pairs, RIAD Bundesbank contains a highly automated block-compounding algorithm which takes the veracity, velocity and modus of the reported attribute into account.