Data Analytics Workshop for Official Statistics Technical Report

data analytics Workshop for official statistics

1 Technical Report

Table of Contents Introduction ...... 3 Workshop objective ...... 4 Report structure ...... 5 The Workshop sessions are then described in Annex 1, summarising the presentations and discussions that took place. Annex 2 includes the Workshop programme. All presentations are available on the dedicated 2 webpage on the CROS portal...... 5 Key conclusions and Recommendations from the Workshop ...... 5 Summary conclusions ...... 5 Recommendations ...... 6 Annex 1 – Description of Workshop sessions...... 7 1.1 Opening session ...... 7 1.2 General Sessions: Expert lectures ...... 8 1.2.1 Organisation of the general sessions...... 8 1.2.2 Expert lecture: Use cases and best practices in data analytics ...... 8 1.2.3 Expert lecture: Trends in data analytics architectures ...... 8 1.2.3 Expert lecture: Future developments in data analytics and data science...... 9 1.2.4 Expert lecture: Engaging users and policymakers for data analytics ...... 10 1.3 General session – Mapping of existing data analytics initiatives in the ESS as well as in comparable sectors ...... 11 1.3.1 Organisation of the session ...... 11 1.3.2 Expert lecture: The Big Data Platform initiative of the EC Joint Research Centre ...... 11 1.3.3 Expert lecture: The Logical Statistical Data warehouse of the Centre of Excellence on Data Warehouse ...... 13 1.3.4 Expert lecture: Methods and algorithms in the UN Global Platform ...... 14 1.4 Parallel group session 1 – Data analytics in practice - real examples from the ESS ...... 14 1.4.1 Objective of the session ...... 14 1.4.2 Summary of the session ...... 14 1.5 Parallel group session 2 – Statistical models and methods for data analytics ...... 15 1.5.1 Objective of the session ...... 15 1.5.2 Summary of the session ...... 15 1.6 Parallel group session 3 – Engaging with external stakeholders ...... 15 1.6.1 Objective of the session ...... 15 1.6.2 Summary of the session ...... 15 1.7 Parallel group session 4 – What are the success factors for data analytics? ...... 16 1.7.1 Objective of the session ...... 16 1.7.2 Summary of the session ...... 16

1.8 Parallel group session 5 – Emerging data analytics tools and techniques...... 16 1.8.1 Objective of the session ...... 16 1.8.2 Summary of the session ...... 16 1.9 Parallel group session 6 – Facilitating the data analytics of others ...... 16 1.9.1 Objective of the session ...... 16

1.9.2 Summary of the session ...... 17 3 1.10 Parallel group session 7 – Joint ESS data analysis undertakings ...... 17 1.10.1 Objective of the session ...... 17 1.10.2 Summary of the session ...... 17 1.11 Parallel group session 8 – Infrastructure and architecture as a solid basis for data analytics ...... 17 1.11.1 Objective of the session ...... 17 1.11.2 Summary of the session ...... 17 1.12 Parallel group session 9 – Data analytics in a changing environment ...... 18 1.12.1 Objective of the session ...... 18 1.12.2 Summary of the session ...... 18 Annex 2 – Programme of the Workshop ...... 19 Day 1 (10 September 2018) ...... 19 Day 2 (11 September 2018) ...... 19 Day 3 (12 September 2018) ...... 20 Workshop programme flyer ...... 20

Introduction Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics Workshop for official statistics" (daWos) is a two-day event organised by Eurostat and held in Amsterdam (10-11 September 2018) that aimed at addressing the challenges and opportunities for the National Statistical Institutes (NSIs), the European Statistical System (ESS) and the broader community of Official Statistics, in the area of data analytics.

Data analytics (DA) is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialised systems and software. Data analytics methodologies include exploratory data analysis (EDA), which aims to find patterns and relationships in data, and confirmatory data analysis (CDA), which applies statistical techniques to determine whether hypotheses about a data set are true or false. EDA is often compared to detective work, while CDA is akin to the work of a judge or jury during a court trial -- a distinction first drawn by Tukey in his 1977 book on EDA2.

1 This includes among others: the 2014 and 2016 ESS Big Data Workshops, the 2016 ESS Visualisation Workshop and the ESS Workshop on dissemination of Official Statistics as open data.

2 J.W.Tukey (1977): Exploratory Data Analysis, Pearson.

More advanced types of DA, sometimes also referred to as Data science3, include data mining, which involves sorting through large data sets to identify trends, patterns and relationships; predictive analytics, which seeks to predict customer behaviour, equipment failures and other future events; and machine learning, an artificial intelligence technique that uses automated algorithms to churn through data sets more quickly than data scientists can do via conventional analytical modelling. Big data analytics applies data mining, predictive analytics and machine learning tools to sets of big data that often contain unstructured and semi-structured data. Text mining provides a means of analysing documents, emails and other text-based content. 4 DA applications involve more than just analysing data. Particularly on advanced analytics projects, much of the required work takes place upfront, in collecting, integrating and preparing data and then developing, testing and revising analytical models to ensure that they produce accurate results. Altogether, DA technologies and techniques are widely used in commercial industries to enable organisations to make more- informed business decisions and by scientists and researchers to verify or disprove scientific models, theories and hypotheses. Workshop objective The approach to the daWos event was thematically focused. Beyond sharing, from different perspectives (e.g., in terms of methodology, technology, or policy), the actual experience and practice in DA (e.g., related to data access and sharing, applications and services, tools and techniques, architecture and technologies, or skills and know-how development), it also had the goal to build capabilities in this area to use data in better way, or at least to provide new insights for Official Statistics.

The workshop aimed at increasing awareness of the issue of DA; identifying emerging best practices to identify synergies and options for joint development efforts; presenting and sharing best practices, original ideas, new tools, and past (successful or failed) experiences. Overall, these objectives were submitted to contribute to the broader objective of facilitating cooperation within the ESS in the development and implementation of common methodological solutions and practical applications, aligned with the ESS Vision 2020 flagship strategy.

To achieve all these objectives, the workshop included:

- Presentations from experts in different areas related to DA. These lectures aim at exposing the audience with current trends and state-of-the-art development in the field of DA – not necessarily connected to Official Statistics and not only referred to experiences in the ESS, but also from other statistical systems and from the private sector – so as to provide useful insights into DA and related issues. - The workshop’s pillar was a set of parallel group sessions devoted to the identification and exchange of relevant experiences in specific issues of DA, within and outside the ESS.

In this setting, participants could share the results achieved through existing DA activities, i.e., in-house use cases and projects. In addition, they were also invited to further detail the mainstreaming of these activities, e.g., how they were implemented in practice and whether they were actually deployed in production. This way, the needs and requirements regarding methodological development and technological investment could be addressed while describing all other issues met, in terms of management, financing and human capital for instance.

3 D.Donoho (2017): 50 Years of Data Science, doi:10.1080/10618600.2017.1384734.

Report structure In order to enhance its reading and use, this report first presents the key conclusions from the Workshop (next section).

The Workshop sessions are then described in Annex 1, summarising the presentations and discussions that took place. Annex 2 includes the Workshop programme. All presentations are available on the dedicated webpage on the CROS portal. 5 Key conclusions and Recommendations from the Workshop This section recalls the main conclusions and recommendations issued from the parallel group and plenary sessions, which are presented in detail in Annex 1. Summary conclusions In recent years, the opportunity to, in various creative ways, deliver information about the impact of certain policies has emerged together with new sources of data. Still, whereas the statistical community has made good progress on using these new sources, many questions and challenges remain when it comes to apply DA so as to extract the relevant information. Advanced, state-of-the-art, DA tools and techniques are increasingly necessary to harness new data sources. The availability of brand new techniques and technologies make a significant change in leveraging today's accessible computing power to enable the processing of large amounts and various types of data into relevant information through statistical analysis and modelling. By creating faster, more agile and more tailored insights in data and content, it is – sometimes prophetically – believed they can help answer increasingly complex questions that were previously considered beyond reach, from description: "what has happened", to prescription: "what should we do?", through diagnosis: "why did it happen?" and prediction: "what will happen?".

Although the evidence-based policymaking guiding principles seem not to change (as for the requirements in terms of transparency, privacy and ethics, quality and robustness, and timeliness), the rise of DA – together with new technologies and trends in sharing, handling, processing and analysing data certainly – calls for an upgrade of evidence-based policy making practices in the NSIs. One of the main challenges facing NSIs is not only to build the data analytics capabilities required to harness the new data sources (e.g., machine learning, data mining, pattern recognition, natural language processing, etc…), but also to understand how these capabilities can be deployed to help satisfy the demand from policymakers and policy users for faster, more agile and more tailored insight. Beyond the need for new approaches and technologies, there is also a necessary adaptation of new job profiles and skill sets for an effective use of data analytics services. There seems to be a general understanding that new practices will require dynamic and innovative partnerships, including public-private ones. This also represents further challenges for NSIs since their role may decrease while that of independent data institutes might grow.

Furthermore, many users of Official Statistics are not willing to compromise at all around the quality and trust of information for the promise of some improvements in timeliness, frequency and granularity. In a context of “post-truth” society, not only quality and trust, but also openness, reproducibility, reliability and sustainability of the statistical products and services are essential to ensure that evidence-based policymaking is transparent and defensible. All such considerations further raise a number of significant issues that need to be addressed by NSIs before deciding whether to embrace DA in the actual and regular production of Official Statistics.

Recommendations Following the various discussions that took place at the daWos event, including the sharing of best practices, practical experience as well as lessons learned by the participants, useful recommendations emerge to address DA from different angles:

 Applications and services: A catalogue that identifies the different use cases using DA as implemented in the NSIs could be useful. The applications and services should be mapped against existing standards for comparability. This should contribute to the identification of emerging best 6 practices and possible synergies and options for joint development efforts, addressing policy issues at cross-border level.  Tools and techniques: A catalogue with the different DA software, algorithms and models that have been used or are still used in the NSIs – including the wealth of assets made freely available on the market – could be established to provide new users with some guarantee regarding the quality of the tools and techniques. These software, algorithms and models could be considered for further – possibly centralised – validation/certification by the NSIs and the community of Official Statistics. When engaging in new software development, NSIs should start small and build quickly software blocks/components that are modular, interoperable, and reusable to be shared within the community and help enrich an ecosystem of DA tools and techniques. Prior to using black-box DA systems – e.g. based on artificial intelligence – the issues of interpretability and/or explicability to the final user of the product should be carefully considered. In general, the reproducibility of production workflows should also be addressed to ensure transparency and trust in the products.  Architecture and technologies: The data and process architecture needed to support future DA capabilities are still under consideration. Besides traditional databases or new big data infrastructure, logical data warehouses are worth exploring since they enable to deal with legacy infrastructure without huge investment through data virtualisation. Curation of metadata and abstract knowledge are essential in that perspective. Virtualised containerised environments are also worth studying for serving applications.  Governance: Existing (good) governance regarding data should be extended to (good) governance of methods and algorithms. Shared protocols towards a code of good practice for using software resources could be developed, or more simply, already existing protocols, e.g. best practices from the open source community, could be adopted. Fair principles (findable, accessible, interoperable, and reusable) could be enforced for tools and techniques, good level of documentation should be adopted for methods and algorithms.  Skills and expertise: DA is an exciting field that combines scientific inquiry, statistical knowledge, subject-matter expertise, and computer programming. The necessary adaptation of job profiles and skill sets to emerging roles for an effective use of DA services within the ESS needs to be addressed (e.g., through the design of appropriate trainings).

Annex 1 – Description of Workshop sessions

This section describes first the contents of the opening, then the general sessions and finally the group sessions (e.g., followed by wrap-up sessions aiming at summarising the discussions that took place during the group sessions and highlighting the main topics/concerns /decisions of interest). Sessions were organised around specific topics selected by Eurostat as follows (detailed descriptions are given in the corresponding sections): 7

- Opening session

- The general sessions included a presentation on use cases and best practices in data analytics, trends in data analytics architectures, future developments in data analytics and data science and engaging users and policymakers for data analytics. Moreover, a general session containing three presentations of experts and addressing other ongoing initiatives had concluded the first day.

- Group session 1: Data analytics in practice - real examples from the ESS

- Group session 2: Statistical models and methods for data analytics

- Group session 3: Engaging external stakeholders

- Group session 4: What are the success factors for data analytics?

- Group session 5: Emerging data analytics tools and techniques

- Group session 6: Facilitating the data analytics of others

- Group session 7: Joint ESS data analysis undertakings

- Group session 8: Infrastructure and architecture as a solid basis for data analytics

- Group session 9: Data analytics in a changing environment 1.1 Opening session Participants were welcomed, on behalf of Eurostat, by Ms Martina Hahn, Head of Unit Methodology and Corporate Architecture in Eurostat. At first, the participants were reminded that the event is organised to contribute to the goals defined in the ESS Vision 2020. The event actually allowed participants to discuss how well the ESS is equipped with the necessary skills and technology of DA for Official Statistics, how much this really helps users and how efficient is the investment by NSIs and Eurostat in this “new” activity.

Ms Hahn stressed that the presence in the event of a mix of statisticians, IT experts, as well as the presence of several international institutions (among others OECD and JRC) was a good sign of interest in this topic. The objectives of the event were also recalled:

 Identify and discuss the different use cases for DA by Official Statistics producers, in particular the impact and opportunities created by external trends and developments (e.g. new user demands, emergence of a rich DA market should be identified in this context).  Take stock of the development of DA in the ESS and identify the main challenges ahead for fulfilling the use cases for DA, in particular the challenges related to data confidentiality.  Discuss the data and process architecture needed to support these new use cases and the future DA capabilities in the NSIs and other ESS members, identifying in particular emerging best practices and possible synergies and options for joint development efforts.

1.2 General Sessions: Expert lectures 1.2.1 Organisation of the general sessions The general sessions were organised as “expert lectures” delivered by four selected experts from different fields (NSIs, academia and consultancy), showing diverse viewpoints and covering general issues in order to give a quite complete idea of the potentiality of DA. In addition, at the end of the second day, a plenary session on the existing DA initiatives in the ESS was organised. The sessions were followed by discussions with the participants. The four expert lectures discussed are presented in the following. 8 1.2.2 Expert lecture: Use cases and best practices in data analytics Speaker: Mr Cédric Archambeau | Principal Applied Scientist, Amazon

This presentation exposes few of Amazon's data driven approaches adopted to make product recommendations to its customers. The speaker shows how the expertise they built over the years in machine learning and data science helps Amazon scale in a wide range of other domains, and create better customer experiences. The application of machine learning to problems in natural language processing, computer vision, search and forecasting are also discussed.

In the first part of the talk, an overview of a number of machine learning applications is given. The speaker explains how these applications fit within the Amazon ecosystem to address the challenges that are faced and how they help scale. While machine learning is routinely used in recommendation, fraud detection and ad allocation, it plays a key role in devices such as the Kindle or the Echo, as well as the automation of Kiva enabled fulfilment centres, statistical machine translation and automated Fresh produce inspection. In the second part, the speaker discusses how machine learning is actually made more accessible within the company by, for example, automating the fine-tuning of machine learning and, in particular, deep learning algorithms. Applying complex predictive systems, such as machine learning- based systems, in the wild requires manually tuning and adjusting knobs, broadly referred to as system parameters or hyper-parameters. Black-box optimisation and in particular Bayesian optimisation provides a natural framework for addressing this problem by taking the human expert out of the fine tuning loop. Bayesian optimization is also shortly introduced in this area. 1.2.3 Expert lecture: Trends in data analytics architectures Speaker: Mr Rick van der Lans | Founder of R20/Consultancy BV, ambassador of Kadenza

To deliver data in support of their business intelligence needs, most organizations have developed a classic data warehouse. New groups of business users with data science and data investigative needs have also developed the data lake, which is most often a standalone system with almost no relationship to the existing data warehouse. On top of this, some organizations have already started to develop a third data delivery system for delivering data to specific business users—the data marketplace. Again, this third data delivery system is being developed as an analytical island, not to mention the still-newer world of streaming analytics.

Developing all these data delivery systems independently is far from ideal. Development-wise, wheels are reinvented, resulting in low productivity, metadata replication, and inconsistencies across reports and analyses. It is crucial for organisations to somehow bring these systems together. One solution is by deploying a unified data delivery architecture based on data virtualization technology. Such architecture can support a wide range of business users, from those demanding a highly agile environment such as the marketplace to those requiring governable and auditable reports. This session had discussed how these environments can be merged into a unified architecture.

Figure 1 – Data virtualization overview.

1.2.3 Expert lecture: Future developments in data analytics and data science Speaker: Mr Fernando Perez-Cruz |Chief Data Scientist at the Swiss Data Science Center

Implicit generative modelling has recently scratched the surface on how deep learning can be used as a universal simulator. Until recently, deep learning has been used quite successfully to solve long standing discriminative problems in computer vision, speech and natural language processing, basically showing that hand-coded human-engineered features are suboptimal in the presence of: complex problems in which human only have a basic understanding of the variability of the data; and, the availability of large labelled data sets.

Figure 2 – Generative Adversarial Networks (GANs) in the bigger AI “big picture”. Flowcharts show how the diﬀerent parts of an AI system relate to each other within diﬀerent AI disciplines; shaded boxes indicate components that are able to learn from data.

Recently, Variational Auto-Encoders and Generative Adversarial Networks (GANs) have shown that the same representation learning can be used for generative modelling. These implicit generative models do not provide an interpretable model for the available data, but a universal simulator that it is able to generate data similar to the one used for training. These tools can be used to simplify complex simulations (e.g. climate models) or limited observations (e.g. cosmology or particle physics), opening the door to Artificial Intelligence powered advances in many different fields of science. In this talk, the speaker first presented the general approaches and methods as well as their potential use and current shortcomings. In the second part of the talk, the speaker described a recent application of GANs for

password guessing. This is an ideal application to understand the need for GANs and understand why they work and what their limitations are. 1.2.4 Expert lecture: Engaging users and policymakers for data analytics Speaker: Mr Franco Accordino |DG CONNECT

There is a growing need to improve forward thinking in policymaking practices and to enable citizens and policy makers to co-create ideas and share evidence in order to feed policy reflections, and eventually inform policy decisions. New policies are often thought up on the basis of current trends rather than by 10 capturing future opportunities offered, for instance, by long-term advances in science and technology.

The need to focus on short-term measures often prevents governments and businesses from orientating their policy choices towards future possibilities, partly because they have been elected to come up with tangible responses to current challenges that matter to citizens and partly because long-term investment decisions may be too risky. This may make it difficult to put in place sustainable solutions to structural problems.

Figure 3 – Building blocks for users', policymakers and citizens' engagement.

The challenges can be articulated along two main axes, highlighting typical tensions between different policymaking mindsets: (i) evidence about the status of the real world vs. inspiration from longer-term thinking; (ii) delegated leadership vs. participatory leadership. Policy Making 3.0 is a participatory and evidence-based model designed to provide an answer to the above challenges. The essential elements of the Policy Making 3.0 process are the following:

1. The implementation of policies co-developed by policymakers and stakeholders has an impact on the real world (individuals, society, economy, environment etc.). 2. The real world is monitored and data are gathered, measured and analysed through knowledge mining and statistical tools, which makes it possible to identify trends, issues and challenges and to elicit scientific evidence. 3. The scientific evidence provides information that stakeholders and policymakers can use to reshape policies. 4. Stakeholders and policymakers interact in social networks where other factors rather than evidence emerge, such as personal opinions, corporate interests, lobbying, ideological values and other ‘non-measurable’ factors (i.e. that cannot be easily sensed and automatically captured).

Such factors often prevail over the scientific evidence. There are also boundary constraints that come in the form of values and laws (e.g. constitutional rules). 5. Policies may also be inspired by desirable visions and aspirations that are not necessarily in line with current, short-term trends and can also be considered as part of the ‘emotional’ and intuitive factors that influence decisions.

The Policy Making 3.0 model is implemented by Futurium, an online laboratory setup to co-develop futures and policy ideas. This laboratory combines the informal nature of social networks with a 11 methodological approach of foresights to engage stakeholders in the co-creation of the futures that they all want. The architecture consists of the following components: front-end participatory tools, knowledge harvesting tools for both policymakers and stakeholders, data-crawling tools to extract knowledge from popular social networks and embed it into the Futurium, data-gathering tools to fetch real world data and to input it into the Futurium.

DORIS - Data Oriented Services is a pool of services enabling policy and support departments to fulfil their daily job in an easy and cost-effective manner. DORIS is largely based on algorithms and data from various sources and providers, meaningfully integrated and made accessible through tailored interfaces (dashboards) and it is customisable to specific use-cases, available as general-purpose service (drive-in) via web app or any Application Programming Interface (API).

Figure 4 – Example of data service: DORIS. 1.3 General session – Mapping of existing data analytics initiatives in the ESS as well as in comparable sectors 1.3.1 Organisation of the session The general session at the end of the first is devoted to mapping on the existing DA initiatives in the ESS as in comparable sectors. The presentation of the three experts discussed:

- The Big Data Platform initiative of the EC Joint Research Centre;

- The Logical Statistical Data warehouse of the Centre of Excellence on Data Warehouse;

- Methods and algorithms in the UN Global Platform. 1.3.2 Expert lecture: The Big Data Platform initiative of the EC Joint Research Centre Speaker: Mr Pierre Soille |EC – DG JRC

The project background is the explosion of digital data sources that led to the big data paradigm (Volume, Velocity, and Variety of data streams); Earth observation (EO) entering big data thanks Copernicus

Sentinel satellites (full, free, and open data). Indeed, the increasing amount of free and open geospatial data of interest to major societal questions calls for the development of innovative data-intensive computing platforms for the efficient and effective extraction of information from these data. Following, the big data task force of the Joint Research Centre (DG JRC) of the European Commission recommended in late 2014 to start a big data pilot project on EO and Social Sensing. The development of the JRC Earth Observation Data and Processing Platform (JEODPP) started in 2016.

The JEODPP platform is versatile in the sense that it accommodates different service levels to satisfy the 12 needs of a variety of users: batch processing, provision of legacy environments, and interactive visualization and processing. All services are accessed through a web browser so that no dedicated client software needs to be installed on the devices accessing the platform. A simplified representation of the JEODPP architecture is shown in the figure below in the form of a three layers stack with the resources layer at its basis, followed by the service layer, and the client layer at its top. The platform already supports a variety of projects serving policy areas in agriculture, forestry, environment, disaster risk management, development, health, and energy.

Figure 5 – JEODPP platform architecture: simplified view with its main layers and components.

The exponential growth of data and data sources is a matter of fact. The big data paradigm is permeating all fields. Fair data principles also apply to data analysis. Challenge of turning data into insights facilitated by platforms with data co-located with processing. Jupyter notebooks contribute to reproducible analysis as well as knowledge sharing and collaborative working. It should be taking into account the importance of interactive analysis and visualisation. Open standards including open API are needed to avoid platform lock-in.

The project evolution is Big Data Analytics (2019-2020) with innovative approaches (AI/machine learning) for combining large amounts of data originating from different sources. The project is supported by the JEODPP and the initial focus will be on geospatial data and their combination with other data sources. It will be the key enabler of data and knowledge sharing across JRC and towards partners and the link with Copernicus Data and Information Access Services.

1.3.3 Expert lecture: The Logical Statistical Data warehouse of the Centre of Excellence on Data Warehouse Speaker: Ms Sonia Quaresma |INE, PT

A Statistical-Data Warehouse (S-DWH) can be defined as a single corporate Data Warehouse fully based on a metadata. An S-DWH is specialised in supporting production for multiple-purpose statistical information. With an S-DWH different aggregate data on different topics should not be produced independently from each other but as integrated parts of a comprehensive information system where 13 statistical concepts, micro data, macro data and infrastructures are shared. The Information Systems connect the business to the infrastructure, this is represented by a conceptual organization of the effective S-DWH which is able to support tactical demands.

In the layered architecture, in terms of data system, different assets are identified:

- the staging data are usually of temporary nature, and its contents can be erased, or archived, after the DW has been loaded successfully; - the operational data is a database designed to integrate data from multiple sources for additional operations on the data. The data is then passed back to operational systems for further operations and to the data warehouse for reporting; - the Data Warehouse is the central repository of data which is created by integrating data from one or more disparate sources and store current as well as historical data; - data marts are kept in the access layer and are used to get data out to the users. Data marts are derived from the primary information of a data warehouse, and are usually oriented to specific business lines. Therefore, data, macro data and infrastructure are shared.

Figure 6 – Logical Statistical Data Warehouse.

The Metadata Management of metadata used and produced in all different layers of the warehouse are specifically defined in the Metadata framework and the Micro data linking. This is used for description, identification and retrieval of information and links the various layers of the S-DWH, which occurs through the mapping of different metadata description schemes; It contains all statistical actions, all classifiers that are in use, input and output variables, selected data sources, descriptions of output tables, questionnaires and so on. All these meta-objects are collected during design phase into one metadata repository. It configures a metadata-driven system well-suited also for supporting the management of actions or IT modules, in generic workflows.

A distributed computing platform leads to the Logical Statistical Data warehouse (LSDW) for the future. LSDW warehouse means adding Semantic Data Abstraction. The abstraction layer allows a conceptual generalization of the sources and of the integration outputs and being based on a semantic middleware that supports context integration it describes each data taxonomy and relates it to each use-case ontology in place. Moreover, a Virtualization Interface which allows access to all corporate data, in different contexts as well as the creation of new contexts.

1.3.4 Expert lecture: Methods and algorithms in the UN Global Platform Speaker: Mr Joni Karanka |ONS, UK

The UN Global Working Group (GWG) on Big Data for official statistics was created in 2014 by the UN Statistical Commission to explore the benefits and challenges of the use of new data sources and technologies for official statistics and SDG indicators. The GWG addresses issues pertaining to methodology, quality, technology, data access, legislation, privacy, management and finance, and

provide adequate cost-benefit analyses. The main assets of statistical offices are data and algorithms - 14 for exploration and data science.

The UN Global Platform is envisaged as a marketplace for sharing and developing core catalogues of services, data, metadata, methods, APIs, information technology tools and training materials. The global network operates as a federated network of platforms at the national, regional and global levels, which ensures interoperability and information-sharing among the platforms in the network through agreed and defined interfaces. The platform is intended for use by its trusted partners and should meet the

Figure 7 – Technology components and main features for methods and algorithms sharing.

requirements for research and development in the use of multisource data. Transparent partnership agreements will need to be developed with private- and public-sector organizations so that the platform partners contribute and derive value through a business model that is individually sustainable for all stakeholders. The platform offers technology infrastructure and constitutes a network for data innovation to facilitate global collaboration of the official statistical community.

The presentation discussed the algorithms in Official Statistics, methods as assets, and the methods service using Algorithmia of the Big Data Global Working Group for the UN (hosted here). 1.4 Parallel group session 1 – Data analytics in practice - real examples from the ESS Chair: Mr Ó'Lúing Mervyn | Central Statistics Office (IE) 1.4.1 Objective of the session While intuitively understandable, the concept of DA can mean different things to different stakeholders.

At this session, examples from ESS members have illustrated what DA could actually achieve for a statistical institute. A discussion on what are the lessons learned from these examples was also initiated. 1.4.2 Summary of the session Different use cases were discussed, e.g., some prototype/pilots and experimental products. DA may be truly disrupting the way NSIs process data (e.g, from data collection to decision-making, through data collection and indicator estimation). However, there are many limitations for adopting and running those pilots in production and those were discussed. It was stressed that, prior to adopting a "full operational

stack" in production, DA tools/methods (or parts of it), whose selection is often ad-hoc, need to be made truly reusable and applicable. Beyond development and deployment, the operational integration of DA is challenged by cultural/corporate consideration since DA is often seen as "yet another tool" with little demonstrated added value by production units. 1.5 Parallel group session 2 – Statistical models and methods for data analytics Chair: Susie Fortier | StatCan (CA) 15 1.5.1 Objective of the session Advanced methods (e.g. model-based estimation, multivariate methods, forecasting/nowcasting and microsimulation) are already being used by statistical institutes - sometimes in regular production, other times in supporting processes or in analytical projects.

At this session, the use of advanced statistical methods in DA for official statistics was discussed. The session also included a presentation from Statistics Iceland on rare events of time: outliers as extreme values and models for population projections taking into account fertility, mortality and immigrations. 1.5.2 Summary of the session The presentation introduced the probabilistic methods used for the demographic projections on the Icelandic population. The key points made during the presentation concern the description of data and the formulation of the mathematical problem, while recalling the definition of rare events of time series (outliers and extreme values) and introducing methods and models as dynamical (ARDL) models (short term), arima/exponential smoothing models (long term). Finally, functional data modelling with time series coefficient functions of orthonormal function expansions was presented. 1.6 Parallel group session 3 – Engaging with external stakeholders Chair: Laust Hvas Mortensen | Statistics Denmark (DK). 1.6.1 Objective of the session External users of DA may come from many different stakeholder groups (policymakers, policy analysts, NGOs, journalists, researchers etc.) Moreover, requests for DA may arrive at a central contact point of a statistical institute, or directly at the entity concerned. This raises the question on how to assure an equitable treatment of stakeholders, and a sound use of resources. At this session, different approaches taken to the interaction with external DA stakeholders were discussed. 1.6.2 Summary of the session Some of the topics discussed during this session are:

- how to communicate probability concept, in particular uncertainty. The way we communicate uncertainty and change for different types of statistics needs to be tailored to suit the audience. We should offer users the opportunity to “zoom in” according to their needs and be able to find detailed information when they require it; - transparency (assumptions to produce official statistics). Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations. In some situations, guideline principles may conflict, requiring individuals to prioritize principles according to context. However, in all cases, stakeholders have an obligation to act in good faith, to act in a manner that is consistent with these guidelines, and to encourage others to do the same. Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical; - presenting data using infographics. These types of graphics present complex information quickly and clearly. Infographics are increasingly popular because they can provide a great amount of complex information succinctly, using visually appealing elements that draw attention and facilitate retention;

- importance of confidentiality in maintaining trust in official statistics. With Big Data, data subjects may be unaware they are generating data and what it can be used for, despite the efforts of the social media platforms in this respect. One of the challenges is to manage the acceptance of data re-use and data linkage, which would not necessarily be expected by data subjects; - identify the users and provide the specific products. Understanding how our statistics are used is essential to maximising the public value of official statistics and ensuring that users are able to make sound and informed judgements from official statistics. 16 1.7 Parallel group session 4 – What are the success factors for data analytics? Chair: Mervyn Ó'Lúing |Central Statistics Office (IE) 1.7.1 Objective of the session Due to their non-traditional nature, DA initiatives might face obstacles during development as well as during subsequent deployment.

At this session, based on practical experience, the different aspects (organisational and infrastructural as well as technical) to ensure the success of DA initiatives in NSIs were explored. 1.7.2 Summary of the session A presentation from Statistics Poland on what are not success factors for DA has been discussed. The key question was on using advanced DA and the issues about the publication of the results. Two cases were described. The first case was about the human capital composite indicators and the risk factors of misinterpret the data government authorities. The second case concerns the big data application and the framework of Map Reduce paradigm which raise questions regarding the quality of new data sources. 1.8 Parallel group session 5 – Emerging data analytics tools and techniques Chair: Joni Karanka | ONS (UK) 1.8.1 Objective of the session Countless advanced tools for DA are being developed. Some of them have a huge potential, whereas others appear to not be applicable to official statistics (for instance because of their "black box" or proprietary nature) - and yet others look impressive, but are in reality just statistical methods relabelled as "data science".

At this session, it was discussed how to actually cut through the hype to find and deploy the modern data analysis tools and techniques that are genuinely useful for official statistics. 1.8.2 Summary of the session The state-of-the-art and advanced DA tools and techniques already in use in the ESS (i.e. automatic data retrieval, web-scraping, data cleaning, natural language processing, machine learning, pattern recognition) and some examples (CSB-NL) was presented. Moreover, technical solutions (tools and software) to share data and enable users to analyse them and the need for a common "playground" platform (e.g., like the UNECE sandbox) was discussed. 1.9 Parallel group session 6 – Facilitating the data analytics of others Chair: Fabio Ricciato | Eurostat 1.9.1 Objective of the session NSIs – regardless of whether they focus on providing complete DA services – might wish to provide standardised components (data access, analytical tools) – either for internal use or for external users of data. By providing small components rather than complete deliverables, a community of data analysts could be fostered. At the same time, while the reputational risks are mitigated by not providing any

analyses, the mere act of providing certain analysis tools might be construed as non-objectivity. At this session, the nature and the approach to provide and share data analysis components were discussed. 1.9.2 Summary of the session The NSIs should improve the accessibility to their data for analysts. The improvements involve further development of APIs, metadata and open data standards. The access to microdata for external data analysts should generally also be improved. Beyond data, NSIs should provide analytical tools that take

into account the typology of the user (everyday users vs. expert users) and the cost of the engagement. 17 At the end, DA services should be made available to all users. A collaborative/participative approach for users consists in sharing the source code. Computing/testing platforms help further engaging external users with methods and tools but more resources are needed. 1.10 Parallel group session 7 – Joint ESS data analysis undertakings Chair: Matyas Meszaros |Eurostat 1.10.1 Objective of the session Many statistical institutes face similar data analysis requests. As for many other areas, there might be a potential for ESS members to collaborate on certain aspects of DA: sharing tools, approaches and methods. This the more as both the ESS Vision 2020 and the ESS priorities beyond 2020 set out needs that could be tackled by DA initiatives.

At this session, the potential for ESS collaboration on DA was explored. 1.10.2 Summary of the session This session discussed some tangible actions and initiatives to launch for truly engaging NSIs in DA so as to foster (or enhance when it already exists) cooperation. Examples from past collaborations run within the ESS framework were regarded as best practices (e.g., JDemetra, VTL, mu- and Tau-argus). Some of the suggested solutions are: the definition of a ESS-wide catalogue of ongoing projects and their use, the adoption of a common repository for sharing resources, the use of generic and concise documentation with examples, the running of training courses/online courses at basic level, the creation of a helpdesk/community support, the sharing of small building blocks with enabled multilingualism. 1.11 Parallel group session 8 – Infrastructure and architecture as a solid basis for data analytics Chair: Sonia Quaresma |INE (PT) 1.11.1 Objective of the session The support for systematic DA (beyond ad hoc one-shot exercises) could imply new architectural and infrastructural requirements for a statistical institute. In some cases, this might be best tackled in a general overhaul of legacy systems.

At this session, the landscape of present and future infrastructure to support DA was discussed. 1.11.2 Summary of the session The session included a presentation on data architecture done by ISTAT which illustrates the legacy architecture integrated taken into account into the new model (e.g., adopting virtual layers within logical warehouse). Existing Statistical Data warehouses (SDw) should be modernised if they are to stay relevant. The Logical Statistical Data warehouse (LSDw) is the next evolutionary step up from the SDw. New sources increase complexity of IT components moves the DWH architectures toward logical architectures. The Logical DWH is a new management architecture combining the strengths of traditional repository warehouses with alternative data management and access strategy. A Logical DWH is an evolution and augmentation of DWH practices, not a replacement. Data Virtualization enables Logical

DWH. The Logical Statistical Data Warehouse is a virtual central statistical data store based on logical layers for managing all available data of interest, improving to: produce the necessary information, (re)use data to create new data/new outputs, perform DA, execute analysis, produce reports, and support dashboard tools. Case Study: SBS-ICT by Web Mining. The case study focuses on the use of survey data as a ground truth to create a classification model enabling the prediction of variables on Enterprises ICT Survey.

1.12 Parallel group session 9 – Data analytics in a changing environment 18 Chair: Jacopo Grazzini | Eurostat 1.12.1 Objective of the session New challenges (such as the General Data Protection Regulation) and opportunities (such as the Third Data Package) are entering the DA scene. Further initiatives are also appearing on the horizon.

At this session, how to position DA in this changing environment was discussed. 1.12.2 Summary of the session Thought it is recognised that DA products are mature enough, the necessary investment by the NSIs to integrate these solutions in production was discussed. Beyond the feasibility, the need for NSIs to adapt is also acknowledged. In acknowledging the potential of DA, it was also mentioned that transformations are actually needed for NSIs to adopt new roles/capabilities. The impact of algorithmic decision- making and how NSIs can communicate about it, e.g. to comply with the requirement for transparency, accountability and traceability, was further discussed. In particular, the need for humans to be involved in the decision-making process was stressed while the adoption of DA methods and tools may guarantee little control. Beyond the question of the data themselves, ethical/legal issues preventing the adoption of DA in production, as well as possible ways to facilitate it, was raised.

Annex 2 – Programme of the Workshop

Day 1 (10 September 2018) daWos satellite meeting

09:00–12:00 Implementing and showcasing data analytics

19 12:00–13:00 Welcome and registration of participants

13:00–14:00 lunch

14:00–14:30 Opening session – Ice breaker

Expert lecture on use cases and best practices in data analytics 14:30–15:30 Cédric Archambeau – Principal Applied Scientist, Amazon Group sessions

Data analytics in practice - Statistical models and methods Engaging with external 15:30–16:30 real examples from the ESS for data analytics stakeholders

16:30–17:00 coffee break

17:00–17:30 Plenary wrap-up session reporting on group sessions I

17:30–19:00 Report and mapping of existing initiatives in the ESS and in comparable sectors

20:00 dinner

Day 2 (11 September 2018)

9:00–10:00 Expert lecture on trends in data analytics architectures Rick van der Lans – Founder of R20/Consultancy BV Group sessions 10:00–11:00 What are the success Emerging data analytics tools and Facilitating the data

factors for data analytics? techniques analytics of others 11:00–11:30 coffee break

11:30–12:00 Plenary wrap-up session reporting on group sessions II

12:00–13:00 Expert lecture on future developments in data analytics and data science Fernando Perez-Cruz – Chief Data Scientist at the Swiss Data Science Center

13:00–14:00 lunch

Group sessions III 14:00–15:00 Joint ESS data Infrastructure and architecture as a Data analytics in a

analysis undertakings solid basis for data analytics changing environment

15:00–15:30 Plenary wrap-up session reporting on group sessions III

15:30–16:00 coffee break

Expert lecture on engaging users (and policymakers) for data analytics 16:00–17:00 Franco Accordino – Directorate-General for Communications Networks, Content and Technology (European Commission) 20 17:00–18:00 Closing session

Day 3 (12 September 2018) daWos satellite meeting daWos satellite meeting -- 09:00–13:00 The geographical dimension of data Balancing confidentiality and utility in data - analytics analytics

13:00–14:00 lunch

Workshop programme flyer