Use of in O cial Statistics

6th International Conference 2018

2018OV010 Eventbanner uRos2018 Rolbanner 100x200_DEF OPTIES .indd 1 23-7-2018 09:58:34

Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Welcome

The global community of R users is growing, and the number of Naonal and Interna- onal Stascal Offices that are adopng R is growing as well. About five years ago, when this conference was organized as an internaonal conference for the first me in Romania, we felt a bit like outlaws using Free and Open Source Soware (FOSS) in an area where commercial packages rule the land. How mes have changed: in the mean me FOSS, and in parcular R is considered a driving force of innovaon in academia, industry and government. The popularity of R is demonstrated by the hundreds of local R user groups, the thousands of R packages, and the RConsorum. The current conference, at Stascs Netherlands, marks the first occasion outside of the place where it was conceived: Romania. We are therefore especially pleased that our keynote speakers have roots in both countries. Alina Matei is a professor of stascs in Switzerland with Romanian roots. She will talk about opmal sample coordinaon using R. An important topic in mes where the reducon of response burden and increasing nonresponse rates force us to use smaller, more complex sampling methods. Not many R users are aware that there is a ‘touch of Dutch’ in R. Since 2017, Jeroen Ooms (UC Berkeley) is the maintainer of both Rtools and R for Windows. He will tell us about what it takes to compile, release, and modernize a system on which more than 12,500 R packages and millions of users rely every day. For the first me this year we have a full day of tutorials with topics including sample straficaon, data cleaning and processing, and geospaal modeling. Make sure to take full advantage of the experts that came here to share their knowledge. With about fiy contributed talks and around one hundred conference aendees, this uRos is the largest in its history. We are grateful to the speakers, tutorial orga- nizers and aendees for making this conference such a growing success.

We wish you an conference. Welcome to uRos2018!

uRos2018 iii Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Organizing partners

Stascs Netherlands

Stascs Romania

Stascs Austria

University of Bucharest

Ecological University of Bucharest

Special Journal Issues

Romanian Stascal Review: http://www.revistadestatistica.ro/ Austrian Journal of Stascs: https://www.ajs.or.at/

uRos2018 iv Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Contents

Welcome iii

Praccal informaon 1

Program overview 3

Session overview 7

Tutorials 13 Fast & efficient data manipulaon with data.table (Jaap Walhout) . . . . 14 Plong spaal data in R (Marjn Tennekes) ...... 15 So your Data is Tidy. But is it Clean? (Edwin de Jonge and Mark van der Loo) 16 Spaal Analysis in R with Open Geodata (Egge-Jan Pollé and Willy Tadema) 17 Use of SamplingStrata for the Opmal Straficaon of Sampling Frames for Mulpurpose Sampling Surveys (Giulio Barcaroli) . . . . 18

Keynotes 19 Sample coordinaon and R (Alina Matei) ...... 20 The R infrastructure and Windows Build System (Jeroen Ooms) ...... 21

Conference presentaons 23 A Corporate Design Toolbox for R (Thomas Lo Russo) ...... 24 A First Step towards Stascal Disclosure Control on Mulple Tables Under the Presence of Differenal Aacks (Kazuhiro Minami and Yutaka Abe) 25

uRos2018 v Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Alternave to LaTeX for high quality report generaon with rmarkdown (Romain Lesur) ...... 26 An all-in-one R applicaon for validang model assumpons in linear re- gression analysis with visualizaons (Joy Chioma Nwabueze and Chisimk- wuo John) ...... 27 An internal package for automated metadata documentaon (Mahias Gomolka) ...... 28 Canadian Consumer Price Index (CPI) Dashboard built using R Shiny (Manolo Malaver-Vojvodic) ...... 29 coder: An R-package for fast classificaon of item data into groups (Erik Bülow) ...... 30 Combining JDemetra+ and R for Analysing and Visualising Time Series in Official Stascs (Atanaska Nikolova) ...... 31 Comparison of mulvariate outlier detecon methods for nearly ellip- cally distributed data (Kazumi Wada, Mariko Kawano and Hiroe Tsub- aki) ...... 32 Development of R Shiny Dashboard on Paern and Characteriscs of Tu- berculosis Incidence in Malaysia (Kamarul Ariffin Mansor, Nurhuda Ismail, Asmahani Nayan and Abd Razak Ahmad) ...... 33 Easily translatable Shiny applicaons (Matjaž Jeran) ...... 34 Easy Bootstrapping for Rotaonal Surveys with ’surveysd’ (Johannes Gussen- bauer, Alexander Kowarik and Mahias Till) ...... 35 Errorlocate: finding errors in data (Edwin de Jonge) ...... 36 Esmang Differenal Mortality from EU-SILC UDB Longitudinal Data (To- bias Göllner, Johannes Klotz) ...... 37 Evaluaon of esmaon methods for a new survey of the UK’s Office for Naonal Stascs (ONS) using R (Konstannos Soulanis) ...... 38 Evidence for the use of alternave data sources to track consumer and business confidence within emerging markets using senment based techniques (Hanjo Odenaal) ...... 39 Experiences in the migraon to RStudio-Server in Stascs Austria (Bern- hard Meindl and Alexander Kowarik) ...... 40 From challenges to opportunies: The Romanian Case of Use R in Official Stascs (Nicoleta Caragea, Ana-Maria Ciuhu and Raluca Mariana Dragoescu) ...... 41

uRos2018 vi Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

How R is improving the disseminaon of stascs within the Department for Work and Pensions (Aoife O’Neill) ...... 42 How the Scosh Government is moving towards R (Victoria Avila) . . . . 43 Interacve data visualizaon web-based applicaon using R-Shiny (Hous- sam Hachimi) ...... 44 Introducing R at Stascs Denmark – a not enrely completed how-to (Pe- ter Tibert Stoltze & other co-authors (to be added)) ...... 45 Introducon to ’flagr’ (Salva Maeo, Eurostat and Mészáros Mátyás) . . 46 Invesgang Chaos in Time Series: Evidence from the Cryptocurrency Mar- ket (Sami Diaf) ...... 47 Lack-of-fit tesng without replicates available – a modern clustering ap- proach (Tyler George) ...... 48 Macroeconomic Stascal Forecasng for Engine Demand (Ankit kamboj, Debojyo Samadder and Ambica Rajagopal) ...... 49 Opmal Boundary Value for Creang Anonymized Microdata: Empirical Analysis based on Economic Survey Data (Yutaka Abe, Kiyomi Shi- rakawa and Hitotsubashi Ryota Chiba) ...... 50 Overlapping classificaon for autocoding system (Yukako Toko, Shinya Iijima and Mika Sato-Ilic) ...... 51 R packages for opmal strafied sampling: a review and compared evalu- aon (Marco Ballin and Giulio Barcaroli) ...... 52 pesm - an R package to compute populaon esmaons using mobile phone data (: Bogdan Oancea, David Salgado, Luis Sanguiao, and Antoniade Ciprian Alexandru) ...... 54 reclin: a package for record linkage and deduplicaon (Jan van der Laan (Stascs Netherlands)) ...... 55 Responsive, web-based graphical user interfaces to R (Adrian Dușa) . . . 56 (R)evoluon of generalized systems and stascal tools at Stascs Canada (Susie Forer and Steven Thomas) ...... 57 R’s Shiny package and Survey Soluons for (Acve) Survey Management (single slot) (Michael Wild) ...... 58 rtrim – an R implementaon of Trends and Indices in Monitoring data (Patrick Bogaart, Mark van der Loo, Jeroen Pannekoek) ...... 59 SelEdit… - a collecon of R packages to implement some opmizaon- based selecve eding techniques (Elisa Esteban, Soledad Saldaña, and David Salgado) ...... 60

uRos2018 vii Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

The Use of R Shiny at the U.S. Bureau of Labor Stascs (Brandon Kopp) . 61 Transforming Health and Social Care Publicaons in Scotland (Anna Price, David Caldwell, Ewout Jaspers, Maighread Simpson) ...... 62 Two main uses of R in Stascs Portugal: sampling and confidenality (Pe- dro Sousa, Conceição Ferreira, Inês Rodrigues, Pedro Campos) . . . 63 Use of Choropleth Maps for Regional Stascs (Jillian Delaney) ...... 64 Using R for analysis and producon of Price Indices for the Producon and Services sector of the economy (Ma Mayhew) ...... 65 Using R for data cleaning, integraon and esmaon challenges in Stas- cs Poland - some conclusions aer VIP.ADMIN project (Beręsewicz Maciej and Pawlikowski Dawid) ...... 66 Using R for variance esmaon in social surveys (Eleanor Law, Vahé Nafilyan, Ria Sanderson) ...... 67 Using R to access official data from the Guatemalan Naonal Instute of Stascs (Oscar de León) ...... 68 Variance esmaon for annual point esmates and net changes for LFS using R package vardpoor (Juris Breidaks) ...... 69

uRos2018 viii Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Praccal informaon

Public wifi

SSID CBS-Public User name uros2018 Password uros2018 Guests understand and acknowledge that we exercise no control over the nature, content, or reliability of the informaon and/or data passing through our network.

Social dinner

The social dinner will take place on Thursday 13 September at 19:00. Restaurant Luden Plein 6–7 2511CR Den Haag The easiest way to get there form CBS is to take the lightrail (tram) number 3 or 4 in the direcon of Den Haag. Get out at stop ‘Spui’, this is the first stop aer the Central Staon. From there it is two minutes to walk.

uRos2018 1 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

uRos2018 2 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Program overview

uRos2018 3 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

12 September: tutorials

Tinbergen Methorst Idenburg 08:30 Registraon (foyer) 09:00 Spaal Analysis in R So your data is Tidy, Fast and efficient with open Geodata but is it Clean? data manipulaon with data.table 10:30 Coffee break (foyer) 11:00 Spaal Analysis in R So your data is Tidy, Fast and efficient with open Geodata but is it Clean? data manipulaon with data.table 12:30 Lunch break (foyer) 13:30 Plong spaal data Using in R SamplingStrata for opmal strafica- on in mulpurpose sampling surveys 15:00 Coffee break (foyer) 15:30 Plong spaal data Using in R SamplingStrata for opmal strafica- on in mulpurpose sampling surveys 17:00 Program ends

uRos2018 4 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

13 September: conference day 1

Tinbergen Methorst 08:30 Registraon (foyer) 09:00 Opening 09:30 Keynote I 10:30 Coffee break (foyer) 11:00 Sampling and esmaon R in organizaon 12:30 Lunch break (foyer) 13:30 Data Cleaning R in producon: data analysis 15:00 Coffee break (foyer) 15:30 Methods for official stascs Shiny applicaons 17:00 Sessions end 19:00 Social dinner at Luden

14 September: conference day 2

Tinbergen Methorst Idenburg 08:30 Registraon (foyer) 09:00 Keynote II 10:00 unconfUROS results 10:30 Coffee break (foyer) 11:00 Time series Reports and GUI pro- R in producon: au- gramming tomaon 12:30 Lunch break (foyer) 13:30 Big data disseminaon and visualizaon 15:00 Closing the confer- ence

uRos2018 5 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

uRos2018 6 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Session overview

uRos2018 7 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Keynote I, 13 Sept. 09:30 Tinbergen Sample coordinaon and R Alina Matei Chair: Nicoleta Carragea

Sampling and esmaon, 13 Sept. 11:00 Tinbergen Using R for variance esmaon in so- Eleanor Law; Vahé Nafilyan;Ria cial surveys Sanderson Evaluaon of esmaon methods for Konstannos Soulanis a new survey of the UK’s Office for Na- onal Stascs (ONS) using R Easy Bootstrapping for Rotaonal Sur- Johannes Gussenbauer; Alexander veys with ’surveysd’ Kowarik; Mahias Till Variance esmaon for annual point Juris Breidaks esmates and net changes for LFS us- ing R package vardpoor Lack-of-fit Tesng Without Replicates Tyler George Available – A Modern Clustering Ap- proach Chair: David Salgado

R in organizaon, 13 Sept. 11:00 Methorst From challenges to opportunies: Nicoleta Caragea; Ana-Maria Ciuhu; The Romanian Case of Use R in Offi- Raluca Mariana Dragoescu cial Stascs Official Stascs Experiences in the Bernhard Meindl; Alexander Kowarik migraon to RStudio-Server in Stas- cs Austria (R)evoluon of generalized systems Susie Forer;Steven Thomas and stascal tools at Stascs Canada How the Scosh Government is mov- Victoria Avila ing towards R Introducing R at Stascs Denmark – Peter Tibert Stoltze a not enrely completed how-to Chair: Alexander Kowarik

uRos2018 8 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Data Cleaning, 13 Sept. 13:30 Tinbergen Overlapping classificaon for au- Yukako Toko;Shinya Iijima;Mika Sato- tocoding system Ilic SelEdit... - a collecon of R packages Elisa Esteban; Soledad Saldaña; David to implement some opmizaon- Salgado based selecve eding techniques Comparison of mulvariate outlier Kazumi Wada; Mariko Kawano; Hiroe detecon methods for nearly ellip- Tsubaki cally distributed data Errorlocate: finding errors in data Edwin de Jonge Chair: Giulio Barcaroli

R in producon: data analysis, 13 Sept. 13:30 Methorst Using R for analysis and producon of Ma Mayhew Price Indices for the Producon and Services sector of the economy Transforming Health and Social Care Anna Price;David Caldwell;Ewout Publicaons in Scotland Jaspers;Maighread Simpson Macroeconomic Stascal Forecast- Ankit Kamboj; Debojyo Samadder; ing for Engine Demand Ambica Rajagopal Two main uses of R in Stascs Portu- Pedro Sousa; Conceição Ferreira; Inês gal: sampling and confidenality Rodrigues; Pedro Campos Chair: Ciprian Alexandru

uRos2018 9 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Methods for official stascs 13 Sept. 15:30 Tinbergen Opmal Boundary Value for Creat- Kiyomi Shirakawa; Ryota Chiba; Yu- ing Anonymized Microdata: Empiri- taka Abe cal Analysis based on Economic Sur- vey Data Reclin: a package for record linkage Jan van der Laan and deduplicaon R packages for opmal strafied sam- Marco Ballin; Giulio Barcaroli pling: a review and compared evalua- on A First Step towards Stascal Disclo- Kazuhiro Minami; Yutaka Abe sure Control on Mulple Tables Under the Presence of Differenal Aacks An All-In-One R Applicaon For Vali- Joy Chioma Nwabueze; Chisimkwuo dang Model Assumpons In Linear John Regression Analysis With Visualiza- ons Chair: Bernhard Meindl

Shiny applicaons, 13 Sept. 15:30 Methorst Rumble, a shiny applicaon for Mod- Ferdian Fadly elling Bayesian Linear Esmaon us- ing R-Inla Package The Use of R Shiny at the U.S. Bureau Brandon Kopp of Labor Stascs Development of R Shiny Dashboard Kamarul Ariffin Mansor; Nurhuda Is- on Paern and Characteriscs of Tu- mail; Asmahani Nayan; Abd Razak Ah- berculosis mad Interacve data visualizaon web- Houssam Hachimi based applicaon using R-Shiny Chair: Guido Schulz

Keynote II, 14 Sept. 09:00 Tinbergen The R infrastructure and Windows Jeroen Ooms build system Chair: Edwin de Jonge

uRos2018 10 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Time Series, 14 Sept. 11:00 Tinbergen Esmang Differenal Mortality from Tobias Göllner; Johannes Klotz EU-SILC UDB Longitudinal Data Invesgang Chaos in Time Series: Sami Diaf Evidence from the Cryptocurrency Market Combining JDemetra+ and R for Atanaska Nikolova Analysing and Visualising Time Series in Official Stascs rtrim – an R implementaon of Trends Patrick Bogaart; Mark van der Loo; and Indices in Monitoring data Jeroen Pannekoek Chair: Kazumi Wada

Report and GUI programming, 14 Sept. 11:00 Methorst A Corporate Design Toolbox for R Thomas Lo Russo Alternave to LaTeX for high quality Romain Lesur report generaon with rmarkdown Easily translatable Shiny applicaons Matjaž Jeran Responsive, web-based graphical Adrian Dusa user interfaces to R Chair: Ana Maria Ciuhu

R in producon: automaon, 14 Sept. 11:00 Idenburg Introducon to ’flagr’ Salva Maeo; Mészáros Mátyás An internal package for automated Mahias Gomolka metadata documentaon Using R for data cleaning, integraon Beręsewicz Maciej; Pawlikowski and esmaon challenges in Stas- Dawid cs Poland - some conclusions aer VIP.ADMIN project R’s Shiny package and Survey Solu- Michael Wild ons for (Acve) Survey Management Chair: Jan van der Laan

uRos2018 11 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Big data, 14 Sept. 13:30 Tinbergen pesm - an R package to compute Bogdan Oancea; David Salgado; Luis populaon esmaons using mobile Sanguiao; Antoniade Ciprian Alexan- phone data dru Evidence for the use of alternave Hanjo Odendaal data sources to track consumer and business confidence within emerging markets using senment based tech- niques coder: An R-package for fast classifi- Erik Bülow caon of item data into groups Chair: Michael Wild Disseminaon and visualisaon, 14 Sept. 13:30 Tinbergen Use of Choropleth Maps for Regional Jillian Delaney Stascs How R is improving the disseminaon Aoife O’Neill of stascs within the Department for Work and Pensions Using R to access official data from Oscar F. de León the Guatemalan Naonal Instute of Stascs Canadian Consumer Price Index (CPI) Manolo Malaver-Vojvodic Dashboard built using R Shiny Chair: Patrick Bogaart

uRos2018 12 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Tutorials

uRos2018 13 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Fast & efficient data manipulaon with data.table

By Jaap Walhout

Stascs Netherlands Descripon: The data.table package is known for its speed and memory efficiency. It also has a bit of a learning curve. In this task oriented and hand-on tutorial you will get a head start with data.table and learn the beauty of its syntax. Aer each explanaon, you will be asked to solve a few exercises so you can internalize each concept. Subjects treated include: Fast input and output of data, Introducon of data.table’s syntax (with a comparison to SQL-syntax), Aggregaon, Updang / adding variables by reference, Manipulang, mulple variables at once, Chaining operaons, Power- ful joins, and Flexible reshaping. Background knowledge: Parcipants are expected to be able to solve (simple) data manipulaon tasks with base R and/or dyverse-packages. Familiarity with SQL is useful, but not necessary. Requirements: Parcipants should bring their own laptops with a recent version of R (3.4+) and a recent version of data.table (1.10.4+)

uRos2018 14 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Plong spaal data in R

By Marjn Tennekes

Stascs Netherlands In this workshop you will learn how to plot spaal data in R by using the tmap pack- age. This package is an implementaon of the grammar of graphics for themac maps, and resembles the syntax of . This package is useful for both explo- raon and publicaon of spaal data, and offers both stac and interacve plong. For those of you who are unfamiliar with spaal data in R, we will briefly introduce the fundamental packages for spaal data, which are sf and raster. With demon- straons and exercises, you will learn how to process spaal objects from various types (polygons, points, lines, rasters, and simple features), and how to plot them. Feel free to bring your own spaal data. Besides plong spaal data, we will also discuss the possibilies of publicaon. Maps created with tmap can be exported as stac images, html files, but they can also be embedded in rmarkdown documents and shiny apps. Tennekes, M., 2018, tmap: Themac Maps in R, Journal of Stascal Soware, 84(6), 1-39

uRos2018 15 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

So your Data is Tidy. But is it Clean?

By Edwin de Jonge and Mark van der Loo

Stascs Netherlands Data cleaning consists of all acvies necessary to make data fit for its intended stascal purpose. Typically data cleaning starts with variable extracon and struc- turing of data. Once data is well-structured, a sufficient level of validity and com- pleteness must be ensured so that the outcome of stascal computaons can be trusted. Validity means that the data values correspond, with reasonable certainty, to their actual values of the real-world properes they describe. The workshop will give a systemac overview of data cleaning and data validaon methods with praccal use cases and excercises in R. The following topics will be treated in the workshop: Structured thinking about data cleaning: data quality and the stascal value chain, Checking for errors: rule-based and reproducible data validaon, Principles of error localizaon and correcon of errors, Handling missing data: imputaon methods in R, Tracking changes in data and monitoring the effect of data cleaning. Literature: Mark van der Loo and Edwin de Jonge (2018)￿ Stascal Data Cleaning with Appli- caons in R. John Wiley & Sons. R packages: deducve, dcmodify, errorlocate, lumberjack, validate, rspa, simputaon, stringdist, dyverse, VIM

uRos2018 16 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Spaal Analysis in R with Open Geodata

By Egge-Jan Pollé and Willy Tadema

Tensing GIS Consultants, Provincie Groningen This session will consist of a series of hands-on exercises. Aer short theorecal introducons the aendees will immediately be able to try it themselves. Most of the data used during this training course are available as Open (Geo)data and thus will be downloaded during the exercises themselves. The goal of this tutorial session is to highlight two topics: The revoluon caused by the recent development of the package sf. Today it is more easy than ever before to use R as your command line GIS (Geographic Informaon System). The increasing availability of Open Data, also data with a spaal component, so: Open Geodata. Some two or three weeks before the start of the conference all training material will be published in our GitHub repository, where it will remain available aerwards. The URL of this repository: https://github.com/TWIAV/Spatial_Analysis_ in_R_with_Open_Geodata

uRos2018 17 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Use of R package SamplingStrata for the Opmal Straficaon of Sampling Frames for Mulpurpose Sampling Surveys

By Giulio Barcaroli

ISTAT The aim of this tutorial is to enable the parcipants to learn how to use the R pack- age “SamplingStrata” in order to opmize the design of strafied samples. The pack- age offers an approach for the determinaon of the best straficaon of a sampling frame, the one that ensures the minimum sample cost under the condion to sat- isfy precision constraints in a mulvariate and mul-domain case. This approach is based on the use of the genec algorithm: each soluon (i.e. a parcular par- on in strata of the sampling frame) is considered as an individual in a populaon; the fitness of all individuals is evaluated applying the Bethel algorithm to calculate the sampling size sasfying precision constraints on the target esmates. Funcons in the package allows to: (a) prepare necessary inputs and check their validity; (b) perform the opmizaon step choosing the values of the most important parame- ters; (c) assign the opmized strata labels to the sampling frame; (d) select a sample from the new frame accordingly to the best allocaon; (e) test the compliance of the design to precision constraints. The package also allows to consider the ancipate variance when the survey target variables are not available in the frame, but only proxy ones. A comparison to package “straficaon” (valid for univariate designs) will be illustrated. Exercises will be proposed to parcipants, that are expected to be acquainted with basics of sampling theory. Ballin M and Barcaroli G. 2013. Joint Determinaon of opmal Straficaon and Sample Allocaon Using Genec Algorithm, Survey Methodology, 39: 369-393 Barcaroli G. 2014. SamplingStrata: An R Package for the Opmizaon of Strafied Sampling. Journal of Stascal Soware, 61(4), 1-24. (http://www.jstatsoft. org/v61/i04/) Barcaroli G., Ballin M., Pagliuca D., Willighagen E. and Zardeo D.. 2018 SamplingStrata: Opmal Straficaon of Sampling Frames for Mulpurpose Sampling Surveys. R package version 1.2 (https://CRAN.R-project.org/package=SamplingStrata)

uRos2018 18 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Keynotes

uRos2018 19 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Sample coordinaon and R

By Alina Matei

Instute of Stascs, University of Neuchâtel Sample coordinaon seeks to create a probabilisc dependence between the selec- on of two or more samples drawn from the same populaon or from overlapping populaons. There are numerous applicaons of sample coordinaon with varying objecves (for example, esmang a difference when updang a sample over me into a repeated sample, if the populaon changes). First, we provide an overview of selected sample coordinaon methods. Next, we focus on special cases of sample coordinaon based on the use of permanent random numbers. We show coordi- naon for maximum entropy samples and spaally balanced samples. Links to R packages useful to implement such cases are shown.

uRos2018 20 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

The R infrastructure and Windows Build System

By Jeroen Ooms

rOpenSci group at UC Berkeley Jeroen has wrien over 40 CRAN packages, many of which have become important pieces of the R ecosystem. As of last year he also builds and maintains the official R toolchain and binaries for Windows, and provides provides builds for dozens of c/c++ libraries through the rwinlib organizaon on Github. This talk gives an overview of the R infrastructure and explains what is involved with building everything for Windows, as well as other plaorms. We highlight some of the powerful open source C libraries that provide the founda- on for the crical funconality in base R and many CRAN packages, and the chal- lenges of keeping the rapidly evolving ecosystem current. Finally we get a preview of the new version of Rtools, which is planned to ship with the new major release of R in 2019, and provide a modern, scalable build system for R on Windows.

uRos2018 21 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

uRos2018 22 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Conference presentaons

uRos2018 23 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

A Corporate Design Toolbox for R

By Thomas Lo Russo

Stascal Office of the Canton of Zurich The Stascal Office of the Canton of Zurich, the regional stascal instute of the most populous Swiss canton, is a relavely small organizaon (30 people). We do not rely on a centralized professional publishing and layout process. We are a group of ‘self-publishers’ creang Infographics, reports and a broad variety of publicaons on a regular basis. To achieve visual uniformity of our products, we have created a ‘Do it yourself’- Corporate Design Toolbox, which allows to generate nicely formaed Charts, Excel- Tables and .pdf Reports straight out of R. Thanks to this toolbox we have become much more efficient in generang ad-hoc as well as automated outputs ready for publicaon. In a next step we are planing to include templates for html-pages cre- ated with .Rmd as well as Shiny. Github-repo: hps://github.com/staskZH/statR

uRos2018 24 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

A First Step towards Stascal Disclosure Control on Mulple Tables Under the Presence of Differenal Aacks

By Kazuhiro Minami and Yutaka Abe

Instute of Stascal Mathemacs / Naonal Stascs Center, Naonal Stascs Center To perform stascal disclosure control (SDC) on mulple tables is a challenging task because sensive informaon can be revealed from intersecons of mulple tables involving a common set of variables. This task is parcularly difficult when each table contains a subset of the common variables because the intersecon of those tables corresponds to a subspace of any shape in the mul-dimensional variable domain. To address this issue, we extend our SDC tool for solving the cell suppression prob- lem of two-dimensional tables to support mul-dimensional ones. Our approach is to take mulple tables as inputs and construct the single common fine-grained mul-dimensional table so that we can represent the constraints of each input ta- ble in an integrated way. Our tool detect possible sensive cells, which could be overlooked when examined separately, by solving a cell suppression problem on the mul-dimensional table. In this paper, we describe the design and implementaon of the new SDC tool and show inial preliminary experimental results. Keywords: stascal disclosure control, cell suppression problems, differenal at- tacks

uRos2018 25 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Alternave to LaTeX for high quality report generaon with rmarkdown

By Romain Lesur

Ministère de la Jusce Reproducible document generaon became an easy task due to and rmark- down. In order to generate paged document (pdf), the usual technology stack relies on LaTeX. While standard LaTeX reports have a high quality typeseng, their aca- demic look and feel does not always suit well with a public communicaon. More- over, if stascal reports have to comply with a corporate design, required customi- saons can become an intractable work: customising a LaTeX document is a highly technical task that needs very specific skills. This communicaon introduces alternave rendering tools that can be used to pro- duce high quality paged documents with R. These sowares (wkhtmltopdf, weasyprint, prince) convert HTML documents to pdf based on a W3C standard named CSS Paged Media (this can be understood as an extension for CSS for print purposes). Since pan- doc 2.0, these pdf engines can seamlessly replace LaTeX in a reproducible workflow with R. This communicaon presents the basics of HTML paged document customisaon with CSS and shows that the required skills are the same as the ones needed for websites customisaon. Therefore, an organizaon possessing HTML styling skills can develop high quality report templates for rmarkdown. In order to facilitate the use of this new technology stack, a package in development (https://github.com/RLesur/weasydoc) and a docker image (https://hub. docker.com/r/rlesur/weasydoc) are presented. Keywords (IEEE taxonomy): Reproducibility of results, Publishing

uRos2018 26 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

An all-in-one R applicaon for validang model assumpons in linear regression analysis with visualizaons

By Joy Chioma Nwabueze and Chisimkwuo John

Michael Okpara University of Agriculture Linear regression models are less credible and rarely applicable when assumpons underlying their usage are violated. In this paper, we developed an interacve easy- to-use and teachable all-in-one R package to validate the linear regression assump- ons to the general users. Our approach was based on the applicaon of R in dis- seminaon of stascs. ￿A merit of this approach was that guidelines with proper explanaons regarding each of the jusficaons were added as footnotes on the val- idated results, making it possible for both Stascians and Non-Stascians to use the R package. Other benefits were that the applicaon reported explanaons on how to recfy failed assumpons, and also augments necessary graphical jusfica- ons alongside the test of hypothesis results. It therefore served as a handy R tool for fast and free usage in validang linear regression model assumpons and appli- caons. Use of empirical data showed the viability of the all-in-one R applicaon. Keywords: Model assumpons; Linear regression analysis; R stascal package; R graphs; Failed assumpons

uRos2018 27 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

An internal package for automated metadata documentaon

By Mahias Gomolka

Deutsche Bundesbank At the Research Data and Service Centre (RDSC) of the Deutsche Bundesbank, we provide microdata for research purposes. In order to make those datasets intellec- tually accessible for researchers, each dataset is accompanied by a document con- taining relevant metadata. This document – which we call ’Data Report’ – is a pdf file generated by LuaLaTeX using a custom document class. The contents of a Data Report are created within the RDSC as well as in other departments. All informa- on from other departments comes in a standardized spreadsheet. This is where the internal package ‘DataReportR’ comes into play. Using DataReportR, we can (1) populate a whole LaTeX template with the qualitave informaon from the other departments. (2), we can create tons of small LaTeX tables containing informaon on each variable, such as a detailed descripon, the type of the variable, the range of availability and others in no me. (3), DataReportR makes it easy to generate code lists as LaTeX tables.This dramacally reduces the amount of me copying and pasng text or fiddling with LaTeX tables. Also, this makes sure that each Data Re- port scks to the corporate design rules. To wrap it up, DataReportR automates all repeve tasks in the generaon of a Data Report.

uRos2018 28 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Canadian Consumer Price Index (CPI) Dashboard built using R Shiny

By Manolo Malaver-Vojvodic

Stascs Canada One of Stascs Canada’s most important programs is the Consumer Price Index (CPI), which is an indicator that measures the changes in consumer prices experi- enced by Canadians. I would like to present an interacve￿data visualizaon appli- caon for the Canadian CPI at the 6th Conference on the Use of R in Official Stas- cs. My presentaon will have two main goals. Firstly, I will highlight how we use R and RStudio at Stascs Canada in order to develop a dynamic dashboard that enhances the ulity of the publicly available datasets of the Canadian CPI by giv- ing users access to a variety of tools within an easily navigable user-interface. Sec- ondly, I will aim to showcase how the shiny, shinydashboard and plotly packages allow for a personalized customizaon of the applicaon’s user-interface, providing a professional interacve environment for displaying adjustable user-input controls through hover plots. I invite you to try our applicaon which is being temporarily hosted at https://kunov.shinyapps.io/(*), and to explore our complete code and datasets which are available at https://github.com/manolo20/. We hope that the presentaon of our CPI dashboard at the 6th Conference on the Use of R in Official Stascs will encourage other stascal agencies to connue to imple- ment and experiment with new methods and technologies in the development of innovave stascal tools in order to improve the capacity of their socio-economic analysis. (*) Note: Your naonal stascal office might restrict the access to this website. Users can also try our applicaon at http://18.191.111.211:3838/cpi_dashboard_ StatCan/ or contact me if there are any issues accessing the web applicaon (manolo. [email protected]).

uRos2018 29 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

coder: An R-package for fast classificaon of item data into groups

By Erik Bülow

The Swedish Hip Arthroplasty Register, Department of Orthopaedics, Instute of clinical sciences at the Sahlgrenska academy of University of Gothenburg, Sweden The coder package is used to classify items from one dataset, using code data from a secondary source. It was first developed for medical comorbidity data based on hospital visits recorded in a naonal paent register. Medical condions were reg- istered using standardized codes (ICD-10) and individual codes were summarized by weighng all individual comorbidies for each paent (the Charlson and Elixhauser comorbidity indices). Only hospital visits recorded within a specified me frame, compared to individual reference dates from the primary data source, were recog- nized as relevant. The large data sets, as well as the complexity of the classificaon schemes make those calculaons me consuming. A naïve approach using bare code comparisons and for-loops in R, took approximately 16 hours to run on a laptop computer with 16 GB of RAM. We were then able to reformulate the coding schemes using reg- ular expressions and we opmized our package using the data.table package. The classificaon me were then reduced to a number of seconds. We also compared the coder package to two packages on CRAN with similar scoop, icd and comorbidi- es.icd10. Our package was 6 and 180 mes faster than those. The coder package does not only include classificaon schemes for comorbidity data. It incorporates a general framework for any case where items can be classified using standardized codes. It might therefore be relevant for many tasks involving data management in official stascs using big data.

uRos2018 30 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Combining JDemetra+ and R for Analysing and Visualising Time Series in Official Stascs

By Atanaska Nikolova

Office for Naonal Stascs JDemetra+ is a free, plaorm-independent and open source soware with various capabilies for me series analysis. The soware has been developed by the Na- onal Bank of Belgium in cooperaon with Deutsche Bundesbank and Eurostat, and is officially recommended by the European Central Bank and Eurostat for the use of seasonal adjustment of me series in official stascs. The soware is wrien in Java and can be called through R with the “rJava” package, and soon the newly de- veloped package - “rJDemetra”. Subsequently, various plots can be generated and integrated into dynamic documents or interacve applicaons. This presentaon will show examples of seasonal adjustment of official me series performed in R by calling JDemetra+. The outputs from the analysis can then be used in an interacve Shiny applicaon where users can choose to explore various elements of the me series – such as seasonal paerns or holiday effects, across different me spans. Other capabilies, such as me series benchmarking, temporal disaggregaon and outlier detecon are also available.

uRos2018 31 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Comparison of mulvariate outlier detecon methods for nearly ellipcally distributed data

By Kazumi Wada, Mariko Kawano and Hiroe Tsubaki

Naonal Stascs Center Mulvariate outlier detecon methods have been discussed in the field of official stascs for more than a decade, but they may not be widely used yet. It is because mulvariate methods are oen computaonally burdensome, and thus it is more difficult to evaluate their outcome. Furthermore, the outliers detected by mulvari- ate methods may vary with different methods, and we may not be able to determine which ones are absolute outliers. We evaluate a few promising methods such as blocked adapve computaonally ef- ficient outlier nominators (BACON), nearest-neighbor variance esmaon (NNVE), and modified Stahel-Donoho (MSD) esmators using a variety of asymmetrically contaminated datasets. These methods are selected to have diversity of methodol- ogy, since there is no best method for all situaons. The purpose of the simulaon study is to illustrate the difference of their tolerances in various situaons. Applicaon in survey data processing is also discussed.

uRos2018 32 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Development of R Shiny Dashboard on Paern and Characteriscs of Tuberculosis Incidence in Malaysia

By Kamarul Ariffin Mansor, Nurhuda Ismail, Asmahani Nayan and Abd Razak Ahmad

Universi Teknologi MARA Tuberculosis (TB) is an infecous disease which normally spread from a person to another through the air. It is usually caused by a bacteria, namely Mycobacterium tuberculosis, that normally affects a person lungs. Nowadays, TB is one of the top 10 major cause of death and remain to be a major problem in the low and middle income country including the developing countries like Malaysia. Even though TB is curable and preventable, the number of reported incidence and the number of deaths caused by TB is sll growing worldwide. The same scenario is happening in Malaysia which called for a proacve acon beforehand in order to foresee the pat- tern and characteriscs of TB in Malaysia so that acon can be taken to prevent and cure the disease to spread even on a bigger scale. Thus, this paper described the R Shiny dashboard development for TB incidence in Malaysia. The dashboard will have features where users not only can visualize interacvely the paerns and character- iscs of TB cases in Malaysia, but also will be able to run a simple guided logisc regression and survival analysis on TB incidence case in Malaysia. The dashboard also will have the capabilies to automacally updated the informaon and analy- sis when new data is feed to the original data file. The dashboard will help those who work directly to TB prevenon and curability by feeding a complete scenario on TB in Malaysia which will help guide them in planning and execute appropriate prevenve acon.

uRos2018 33 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Easily translatable Shiny applicaons

By Matjaž Jeran

Bank of Slovenia Euro system of central banks is in transformaon from one official language only to mullingual system. Staff in the central bank is now recruited not only from the member country but from other countries as well. A soware product of one central bank or stascal instute may be used in other similar instuon in Europe - but it is expected to be translated into the local language. These two reasons movate the developers for to get soware products that are mullingual - including Shiny applicaons. This paper gives a set of simple recipes how to make Shiny applicaons mullingual and easily translatable. The paper starts with a simple program (Hello world) and its transformaon from completely natural language dependent version into one that can be translated by a simple translaon tool – text editor – ready available on any plaorm. The paper deals also with the input data and output data, determining other cul- tural features like specific characters support and collang sequence. The support of these features is included in operang system and R. How to display numeric val- ues within Shiny applicaon is the task le to the Shiny applicaon developer. The paper also shows the most important snippets of code of some easily translat- able Shiny applicaons used within Bank of Slovenia. Keywords: R soware, Shiny, GUI, internaonalizaon, localizaon

uRos2018 34 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Easy Bootstrapping for Rotaonal Surveys with ’surveysd’

By Johannes Gussenbauer, Alexander Kowarik and Mahias Till

Stascs Austria With the R package surveysd, we present a package for esmang standard errors for yearly surveys with and without rotang panel design. Compley survey designs, e.g. mulstage sampling is supported and can be freely defined. The implemented method for standard error esmaon uses bootstrapping techniques for mulple consecuve waves of the survey. It is possible to compute an esmate based on several years, which leads to a reducon in the standard error, especially for es- mates done on a subgroup of the survey data. The package enables the user to esmate point esmates as well as their standard error on arbitrary subgroups of his/her data. Also the applied point esmate can be chosen freely with some pre- defined point esmates ready-to-use. Finally the results can be visualized in two different ways to gain quick insight in the quality of the esmated results.

uRos2018 35 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Errorlocate: finding errors in data

By Edwin de Jonge

Stascs Netherlands An important but undervalued acvity for stascal analysis is data-cleaning. No measurement is perfect, so data oen contain errors. Obvious errors e.g. negave ages are easily detected, but observaons that contain variables that are logically related e.g. marital status and age are more tricky. R package errorlocate allows for pin poinng errors in observaons using the Feligi-Holt algorithm and validaon rules from R package validate. The errors can automacally be removed using a pipe-line syntax.

uRos2018 36 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Esmang Differenal Mortality from EU-SILC UDB Longitudinal Data

By Tobias Göllner, Johannes Klotz

Stascs Austria Socio-economic differences in mortality have become increasingly important in an era of pension reforms. Some European countries cannot provide any figures on the subject, and available figures are not easily comparable between countries because of different data sources, me periods and straficaon variables. As part of the FACTAGE project, we developed a new and relavely easy approach to obtain com- parave European figures based on harmonized survey sample data. Longitudinal informaon on persons and households of the European Survey on Income and Liv- ing Condions (EU-SILC) is extracted from Eurostat’s User Database (UDB) which is available to researchers carrying out stascal analyses for scienfic purposes. Inially this method was implemented for SAS, but was sub sequenally translated into R, to increase a potenal user base. The R code consists of funcons and raw code to form a crude pipeline of work. The method allows for differenal mortality esmaons by different variables, as long as they are part of EU-SILC’s harmonized target variables. While our analyses are from a comparave European perspecve, the method is in principle also useable for single country analyses, or small groups of countries. The presentaon will show results of differenal mortality esmaons using our method and then focus on the implementaon into R. This will open the discussion on how the implementaon can be improved both from an R programmer and from a social sciensts point of view.

uRos2018 37 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Evaluaon of esmaon methods for a new survey of the UK’s Office for Naonal Stascs (ONS) using R

By Konstannos Soulanis

Office for Naonal Stascs The Annual Survey of Goods and Services (ASGS) is a new survey measuring the de- tailed annual product turnover of UK service sector businesses. Originally it was en- visioned that ASGS esmates would be calculated using a tradional design-based expansion esmator. But once that data had been collected, a model-based con- dional rao esmator, as used for the ProdCom survey in the UK, was also con- sidered. This presentaon compares these two methods, and describes how they were evaluated to determine which is more efficient for ASGS. As the data cover about 2,000 service products, from 40,000 businesses across 282 different industry classes, R was considered an appropriate tool to deal with the complex calculaons and visualisaon of the results. This involved the use of several well-known R li- braries and especially the “dyverse.” The outcome of the invesgaon was that the model-based esmator gave superior results compared to the expansion es- mator, especially for larger sample sizes.

uRos2018 38 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Evidence for the use of alternave data sources to track consumer and business confidence within emerging markets using senment based techniques

By Hanjo Odenaal

Bureau of Economic Research of South Africa Recent case studies on the construcon of consumer confidence indices using on- line media data have started appearing in literature. These aptly named ‘senment indices’ are constructed using text-based analysis. The obvious advantage of text- based measures of economic tracking is the coverage and cost aspect of these al- ternave surveying methods. We aempt to gauge the feasibility of construcng online senment indices using large amounts of text data as an alternave to the convenonal survey method. This paper adopts a quantave framework to provide an indicaon of candidate senment indices that best reflect the tradional survey based confidence indices conducted by the BER. The results indicate that for the consumer confidence index (CCI), the best candidate indices are constructed from News24 data using specialised financial diconaries, while for the business confidence index (BCI), the Financial Mail data provided good candidate indices. Finally, composite two indices are con- structed using PCA. The resulng indices provide strong evidence for the use case of alternave data sources to track consumer and business confidence within an emerging market such as South Africa using senment based techniques.

uRos2018 39 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Experiences in the migraon to RStudio-Server in Stascs Austria

By Bernhard Meindl and Alexander Kowarik

Stascs Austria In this contribuon we discuss the experiences we have made in the last year in mi- grang the R-Users within Stascs Austria. In 2018 we started to move selected users from Rstudio Desktop to Rstudio Server. To minimize the impact of changes, we set up a two-phase infrastructure consisng of a tesng- and a producon envi- ronment. In order to properly scale the servers, we aimed for a smooth transion and not trying to migrate all users at once. Thus, monitoring of the workload on the servers was an issue that we kept in mind. Furthermore, we also created a new training course that was aimed specifically at our exisng R users which wanted to move to the server infrastructure. In the process, we have developed several new R packages for internal use (for example to connect to databases or to mount shares) that help our staff in the transion phase. We give an overview of issues that we have encountered and how we tried to tackle them and give insights on the remaining problems that need to be solved and out- line future plans that include using Rstudio Connect which may change the way we operate, callaborate and distribute our work.

uRos2018 40 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

From challenges to opportunies: The Romanian Case of Use R in Official Stascs

By Nicoleta Caragea, Ana-Maria Ciuhu and Raluca Mariana Dragoescu

Naonal Instute of Stascs, Ecological University of Bucharest, Instute of Naonal Economy, Romanian Academy This presentaon is two-fold. One is concerning organizaonal and technical aspects of introducing R to the stascal office at Romanian NIS and the other one is focused on teaching R to users in the office. At Romanian NIS, R is most likely used in the social stascs. Current developments of the use of R include, but are not limited to: managing data-bases from stascal and administrave sources (data-cleaning, outliers detecon and treatment, data validaon), Data-matching, Record Linkage, Imputaon, Data hashing & anonymiza- on techniques, Small area esmaon (internaonal migraon and poverty indica- tors), Big Data (Romania is part of the ESSnet big data project). Training of stascians in the office was realized at the beginning by one-week courses. Currently, teaching R to users in the office consists in six-months training on the job for groups of ten people. This permits an oriented training, with applicaons of daily basis acvity. We conclude with a series of proposals on the future research opportunies and other potenal analysis procedures of R in the social stascs.

uRos2018 41 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

How R is improving the disseminaon of stascs within the Department for Work and Pensions

By Aoife O’Neill

Department for Work and Pensions We would like to present how R is improving the producon of stascs within the Department for Work and Pensions (DWP), the UK’s biggest government depart- ment. DWP administers pension and working age, as well as disability and health related, benefits to around 20 million customers, amounng to over €200 billion in expenditure. The stascs published by DWP are used by a range of customers such as local government, who rely on our stascs for strategizing how they provide ser- vices. With such a large responsibility, it is important that accurate and reliable stascs are being produced by the Department. This talk would cover how R Markdown has improved the way we produce stascs and how we have benefited from work conducted in other UK government departments. Using R to automate the creaon of our stascal summaries means that more resource can be given to improving and developing our stascs, which is serendipitous given the changes being made to DWP’s services. Ma Upson, a formal UK civil servant, highlighted how, unlike in academia where there are numerous journal arcles on a topic, official stascs are oen “the single source of truth”. This is especially true in DWP,where the department is responsible for deploying its resources, as well as developing policy, meaning that it is the source of stascs on the services it provides. This work makes use of the R Package “gov- style” which applies the formats used in our stascal summaries, improving the efficiency of re-creang them in R across government.

uRos2018 42 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

How the Scosh Government is moving towards R

By Victoria Avila

Naonal Records of Scotland In this talk I will present the steps the Scosh Government and other Scosh public sector organisaons are taking to adopt R for their analycal needs, driven by the promise of reproducible data analysis pipelines, powerful visualisaons and lower costs. I will outline the main barriers to introducing R in government organisaons, from IT culture to the career progression of stascians, and how the Scosh Government is overcoming them. I will present some tangible results, how far we have got and our future plans.

uRos2018 43 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Interacve data visualizaon web-based applicaon using R-Shiny

By Houssam Hachimi

High Commission for Planning-Naonal Accounts Department Official stascs data are commonly considered of high quality. In order to enhance the ulity of these stascs we have to present them in forms that are easy for the user to interpret by using the power of data visualizaon and its tools. For this purpose, this paper exposes the implementaon of an interacve data visualizaon web applicaon enrely in R by using Shiny, a web-based applicaon framework for R, which makes it easy to adapt R scripts into user-friendly Shiny applicaons and allows to access your data through an interacve web browser interface. The High Commission for Planning (HCP) in Morocco is one of many Naonal Stascal Organizaons that have been researching and developing different tools and meth- ods to present their data in a more aracve format. Thus, the Naonal Account department related to the HCP has led this iniave and has elaborated this data visualizaon project using Naonal Accounts data. Keywords: Official Stascs, Data Visualizaon, R Shiny, Web Applicaon.

uRos2018 44 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Introducing R at Stascs Denmark – a not enrely completed how-to

By Peter Tibert Stoltze & other co-authors (to be added)

Stascs Denmark For many years, the producon of official stascs at Stascs Denmark has been based on a mix of different technologies, predominantly SAS and Excel. The develop- ment of the stascal producon system has been managed locally, and technology has oen been a maer of personal preferences. A paral migraon of our producon systems to R is desirable for several reasons: Many iniaves within the European Stascal System have provided high quality and easy-to-use R-packages; New employees are typically well-versed in R rather than SAS; Improved efficiency is necessary in order to invest in other areas, e.g. Big Data; The current code legacy is not sustainable. The introducon of R will be accompanied by standardizaon iniaves based on Generic Stascal Business Process Model. Among other things this means that the decomposion of the enre producon process in sub-processes will be reflected in the organizaon of the source code. This will allow greater reuse of code to spe- cific processes across stascal domains and thus increase efficiency, e.g. in the methodology department. We are currently developing a strategy for introducing R, which has consequences beyond the choice of a preferred technology. Among other things, the strategy will make use of a version management system (e.g. Suberversion or Git) mandatory for all source code. We also plan to disnguish between development or experimental code, and peer-reviewed producon-grade code. Finally we put forward the hypothesis, that introducing what at first seems like a rigid framework actually allows for quicker adaptaon of changes, e.g. because the expected input and the generated output are both well defined. It also allows for appropriate choice of tools for each process, e.g. Python rather than R or even reuse of exisng SAS-code.

uRos2018 45 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Introducon to ’flagr’

By Salva Maeo, Eurostat and Mészáros Mátyás

Eurostat The object of this paper is to present the R package ’flagr’ that is in development in Eurostat for facilitang the internal revision of the use of flags and flagging of aggre- gates in disseminaon. The ’flagr’ package provides general funcons following the methodological guidelines suggested by the SDMX￿￿ for the aggregate. The pack- age provides three different funcons how the individual flags can be transferred to the aggregate. The first one is the hierarchy of the SDMX flags suggested by the implementaon guidelines. This method compares all flags of a given dataset and keeps the flag for the aggregate with the highest score on the SDMX hierarchy or in a personally specified order. The second method counts the occurrences of the flags in the underlying data and the flag for the aggregate will be the flag that has the high- est count. The last method not only counts the frequency of a flag is represented in the dataset, but also it also it takes into account the weight of the individual values, as the contribuon of the corresponding individual value to the aggregate. The flag, which has the highest summed weight, is used for the flag of the aggregate if it is above a certain threshold.

uRos2018 46 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Invesgang Chaos in Time Series: Evidence from the Cryptocurrency Market

By Sami Diaf

Informaon Systems and Machine Learning Lab (ISMLL), Universität Hildesheim This paper invesgates the stascal properes of the five main cryptocurrencies by market capitalizaon, since 2013, to determine to nature of their embedded pat- terns. Irrespecve of the current debate surrounding the cryptocurrency market and the Fintech trends, the results clearly indicate a higher variability of these se- ries coupled with the presence of chaoc paerns. Hence, the use of advanced stascal techniques proved to be crucial to shed light on the underlying behavior and the randomness of the process generang data. Parcularly, this work ex- plores the nature of memory affecng the data using fractal analysis, and inspects its predictability by compung the maximum Lyapunov exponent to idenfy pos- sible chaoc paerns. This analysis enlightens central bankers and help financial regulators gauging the risks surrounding the development of cryptocurrencies. R packages used: crypto, fractal, pracma, nonlinearTseries, tseriesChaos, coinmar- ketcapr, , ggplot2.

uRos2018 47 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Lack-of-fit tesng without replicates available – a modern clustering approach

By Tyler George

Department of Mathemacs, Central Michigan University The classical lack-of-fit (LOF) test for checking the linearity assumpon of a linear re- gression model requires replicates in the predictors. Replicates are oen not avail- able in high dimensional data or when data collecon is unstructured; a common occurrence with big data. A new R package called LOFnorep is presented to test for LOF in a linear regression model when replicates are not available. This package ulizes a grouping data methodology entailing three algorithmic steps: first choos- ing a grouping method to group the data, then fing linear regression models to each group, and finally, comparing these fits. The developed R package uses clus- tering funcons, including those found in the mclust R package, to form its groups. The new algorithm described performs beer than some exisng tests in simulated power studies. It is also more robust for tesng wide variees of data structures such as sinusoidal, quadrac, and cubic. The results indicate the new algorithm has a higher or equal power in all these data configuraons. Addionally, it’s applicable to high dimensional and big data.

uRos2018 48 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Macroeconomic Stascal Forecasng for Engine Demand

By Ankit kamboj, Debojyo Samadder and Ambica Rajagopal

Cummins India Ltd. Forecasng demand is a crical issue for driving efficient operaons in a manufactur- ing firm. Due to this reason firms are concerned to plan their operaons and strive to improve their forecasng methods for having an edge over the competors in market. The purpose of this paper is to evaluate various shrinkage methods for data containing large numbers of features. Here we focus on Class 8 Group 2 North Amer- ica Heavy Duty (NAHD) market and macroeconomic indicators from ACT research economic database to forecast full 3 months out shipment of engines. Various pre- processing techniques were applied on all the variables and then they were further decomposed by applying Seasonal and Trend decomposion using Loess (STL) into its components (trend, seasonality and remainder). Then for each pre-processing technique the decomposion was analysed visually. Aer this the relave signifi- cance of the variance associated to each decomposed component was ulized to select the appropriate pre-processing technique for all the variables in order to en- sure their staonarity for reliable forecasng accuracy. We applied several stas- cal as well as machine learning methods and obtained an ensemble of them to have minimal error in forecasng. It was also noced that there was hardly any increase in accuracy when the number of features is increased beyond 25-30. Following are the few important R packages that were used in our analysis: forecast, forecastHybrid, tseries, readxl, xts, quantmod, e1071, lars. Keywords: Box-Cox Transformaon, Staonarity, STL Decomposion, Least Angular Regression, Shrinkage, Lasso, Support Vector Regression

uRos2018 49 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Opmal Boundary Value for Creang Anonymized Microdata: Empirical Analysis based on Economic Survey Data

By Yutaka Abe, Kiyomi Shirakawa and Hitotsubashi Ryota Chiba

Hitotsubashi University / Naonal Stascs Center, Hitotsubashi University, Naonal Stascs Center When ministry or agency of Japan creates s anonymized data of official stascs, they oen use top (or boom) coding to anonymize variables in quesonnaire infor- maon. Because ministry or agency does not suppose how researchers use anonymized data, they do not do data transformaon such as logarithmic transformaon before the top coding. On the other hand, researchers oen transform data before their analysis, especially to analyze economic survey data. Therefore, we examined the influence of the top coding on linear regression to research suitable top code (upper limit value) for a variable. We used quesonnaire informaon of “2015 Survey of Research and Development”. The survey is designed to provide basic materials for promong science and tech- nology in Japan, by studying the research and development (R&D) acvies carried out in Japan. We developed an R program to esmate lower bound of the suitable top code, for making guidance of top coding on official anonymized data. First, this program searches borders of groups in a given dataset by decision tree method and sets them possible opons of the top code’s lower bound, and then select the suitable bound which AIC of the Chow test is the smallest. We strafied the quesonnaire infor- maon and the data which aer the top coding fulfilled the selected bound, and transformed each of the stratum. Thereaer, we verified the informaon-loss from the top coding in each stratum, by making a comparison between an esmated lin- ear regression of distribuon which deleted by top coding and a linear regression of original distribuon.

uRos2018 50 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Overlapping classificaon for autocoding system

By Yukako Toko, Shinya Iijima and Mika Sato-Ilic

Naonal Stascs Center, Research and Development Division, Faculty of Engineering, Informaon and Systems, University of Tsukuba Coding is the classifying of objects based on given classificaon codes, and it is oen required in the field of official stascs. We developed the supervised mulclass classifier for autocoding that was implemented with R in our previous study. The purpose of our study is efficiently applying this classifier to the coding task of the Family Income and Expenditure Survey in Japan. Although the developed classifier has high accuracy for the autocoding task, for some objects with ambiguous input informaon are sll incorrectly assigned codes. In analyzing the incorrectly classified data, we found that a classificaon code could not be uniquely determined for some objects because of the semanc problem, interpretaon problem, and insufficiently detailed input informaon. These issues suggest that there is a need for the development of a classifier that lists mulple can- didates for coding tasks. We propose a new classifier that lists mulple candidates in descending order of the degree of reliability in output data, and assists experts to select a correct code from the listed candidates. Also, by used of a new reliability score based on weights of entropy, the accuracy and praccability of the proposed classifier are improved while the advantages of structural simplicity of the algorithm and praccal calculaon me have remained unchanged. The proposed algorithm is implemented with R to improve its versality. During our presentaon, we will present the details of the algorithm of the new classifier including an illustrave numerical example with survey data. Keywords: Coding, Machine learning, Overlapping classificaon

uRos2018 51 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

R packages for opmal strafied sampling: a review and compared evaluaon

By Marco Ballin and Giulio Barcaroli

Italian Naonal Instute of Stascs “SamplingStrata” and “straficaon” are two R packages useful to determine the opmal straficaon ensuring the best combinaon of expected precision levels and sample size. In this presentaon they are compared in parcular with respect to the efficiency of the soluons, but also to the coverage of the overall sampling process. Package “straficaon” operates only in univariate cases, where one connuous variable Y is considered as target. Various algorithms are available to opmize stra- ficaon and allocaon, minimizing the total sample while sasfying a given precision constraint under a given number of strata. Y can coincide with the X variable avail- able in the frame, otherwise the package permits to introduce a model between X and Y and take into account the related ancipated variance. “SamplingStrata” handles a more general problem, with no limits on the number of target (Y) variables and auxiliary (X) frame variables; moreover, the opmizaon problem can be solved by considering different domains. Also in this case it is pos- sible to introduce models (one per each Y variable) in order to take into account the ancipated variance. A complete efficiency comparison can be performed in a straighorward way in the univariate case, where the packages produce the same results (but “straficaon” in less me), both in the simple case of Y as target variable and in the case where a model between X and Y is considered. The mulvariate case comparison can be carried out only introducing a procedure where “straficaon” is separately applied to the different Y’s and results are combined in order to be confronted with “Sam- plingStrata” ones. Baillargeon S and Rivest L-P 2011. The Construcon of Strafied designs in R with the package straficaon. Survey Methodology, 37: 53-65. Baillargeon S and Rivest L-P 2014, straficaon: Univariate Straficaon of Survey Populaons. R package version 2.2-5. (http://CRAN.R-project.org/package= stratification)

uRos2018 52 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Ballin M and Barcaroli G. 2013. Joint Determinaon of opmal Straficaon and Sample Allocaon Using Genec Algorithm, Survey Methodology, 39: 369-393 Barcaroli G. 2014. SamplingStrata: An R Package for the Opmizaon of Strafied Sampling. Journal of Stascal Soware, 61(4), 1-24. (http://www.jstatsoft. org/v61/i04/) Barcaroli G., Ballin M., Pagliuca D., Willighagen E. and Zardeo D.. 2018 SamplingStrata: Opmal Straficaon of Sampling Frames for Mulpurpose Sampling Surveys. R package version 1.2 (https://CRAN.R-project.org/package=SamplingStrata)

uRos2018 53 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

pesm - an R package to compute populaon esmaons using mobile phone data

By : Bogdan Oancea, David Salgado, Luis Sanguiao, and Antoniade Ciprian Alexandru

University of Bucharest and Naonal Stascs Instute of Romania, Dept. Methodology and Development of Stascal Producon, INE, Ecological University of Bucharest and Naonal Stascs Instute of Romania Integraon of the mobile phone data in the producon of official stascs was one of the main goals of the ESSnet Big Data project. In this regard, we developed an R package to compute populaon esmates following a methodology inspired from the ecological sampling techniques. This methodology uses a Bayesian approach that is computaonally intensive but allows a straighorward code parallelizaon. Since some funcons of our package are very demanding from a computaonal point of view, we implemented them in C++ and integrated with the rest of the package using Rcpp. More, the C++ code is also parallelized, and we chose RcppPa- rallel package for this purpose. The esmaon procedure combines mobile phone data sets with another data source which can be a populaon register and produces esmates for each territorial division of a geographical area and along a sequence of me instants for which we have data from Mobile Network Operators. One of the hypotheses of the underlaying mathemacal model was the independence of the esmates between different cells which allowed us to use a parallel procedure to perform computaons for each cell. Besides populaon esmates, it provides a set of accuracy indicators. pesm was developed with an eye on portability and it can be used on both Windows and Unix-like operang systems. We tested our package using synthec generated data and showed that it has a good scalability which is an essenal characterisc for real mobile phone data.

uRos2018 54 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

reclin: a package for record linkage and deduplicaon

By Jan van der Laan (Stascs Netherlands)

Stascs Netherlands The goal of record linkage and deduplicaon is to determine which records belong to the same enes. When the records belong to different sources the term record linkage is used; when the records come from the same source the term deduplica- on is used. When a unique key the idenfies the enes is available for all records this process is simple and there are plenty of rounes is R that can be used (e.g. merge or dplyr::le_join). However, in case the linkage key consists of mulple vari- ables that can contain errors and are not necessarily unique (such as name, address, date of birth), the process becomes more complex. For this the reclin package has been developed. It implements the Felligi-Sunter method of probabilisc record linkage, but it is set up as a toolbox that can also be used to assist in using other methods such as machine learning. The general process of probabilisc record link- age will be presented together with how the reclin package assists this process

uRos2018 55 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Responsive, web-based graphical user interfaces to R

By Adrian Dușa

University of Bucharest There are a fair number of in and for R, in various program- ming languages such as Java (JGR, Deducer, Glotaran), Python (OpenMeta-Analyst), either as standalone soware (such as Rale) or available as internal R packages (for instance ). Lawrece and Verzani (2012) present no less than four different frameworks to de- velop GUIs in and for R: gWidgets, RGtk2, Qt and Tcl/Tk, but interesngly they did not touch on yet another, currently fashionable environment like Javascript. The well known RStudio is essenally a web page, and modern interfaces to R can be wrien using the shiny package, yet another tool to make R available via a webpage. This presentaon introduces a different kind of web-based graphical user interfaces. Starng from an inial applicaon based on the shiny package, it is now evolving into a self-contained, installable soware using the Electron.js framework.

uRos2018 56 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

(R)evoluon of generalized systems and stascal tools at Stascs Canada

By Susie Forer and Steven Thomas

Stascs Canada As part of its modernizaon iniave, Stascs Canada is moving beyond a survey- first approach with leading edge methods to integrate and transform a variety of data sources into￿stascal informaon to support evidence-based decision-making. To go along with the development of new methods that address emerging data avail- ability and data needs, new computer soware opons must be studied, developed and implemented within the Stascs Canada’s environment. A large aspect of this work is concentrang on the adopon, use and support of open-source soluons such as R. At the same me, connuous support must be provided to the set of ro- bust methodological tools developed throughout the years to apply methods said to be “tradional”. This talk will present Stascs Canada’s experience with the ex- ploraon of R as a research and producon tool. Issues related to governance, risk assessment and capacity building will be covered, as will other opportunies and challenges.

uRos2018 57 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

R’s Shiny package and Survey Soluons for (Acve) Survey Management (single slot)

By Michael Wild

World Bank This tutorial will give an introducon on how to combine R’s powerful shiny package and the Survey Soluons REST API, to manage surveys in different survey modes. The focus will mainly be on Face-to-Face and Web surveys, however the interfaces we create can also be used for telephone surveys. We will use remote sensing data, to determine frame imperfecon, as well as survey and other survey paradata to monitor data quality. With this data at hand, we will create different dashboards which facilitate the work of survey managers, and allow for monitoring of the in- coming data quality almost in real me. Shiny is an open source R package that provides an elegant and powerful web frame- work for building web applicaons using R. Shiny helps you turn your analyses into interacve web applicaons without requiring HTML, CSS, or JavaScript knowledge. Survey Soluons is a powerful Computer Assisted Survey System, which includes data and survey management funconalies plus a highly versale REST API. This REST API allows to connect fully customized survey and data management tools, which take the acve survey management paradigm to the next level Besides shiny we will also make use of the following packages: data.table (Dowle et al., 2018) , dplyr (Wickham, 2017), hr (Wickham, 2017), raster (Hijmans et al. 2017), and the new sf (Pebesma et al., 2018) package.

uRos2018 58 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

rtrim – an R implementaon of Trends and Indices in Monitoring data

By Patrick Bogaart, Mark van der Loo, Jeroen Pannekoek

Stascs Netherlands In the Netherlands, ecological monitoring data for birds, buerflies, and other faunal groups are collected by NGO’s such as the Dutch Buerfly Conservaon and analysed by Stascs Netherlands to compute naonal or regional trends in annual species abundance. The methodology used is an extended form of Poison regression, im- plemented as a GLM, but allowing for e.g. overdispersion, serial correlaon and up- scaling from mulple subregions. The specific nature of the monitoring data, with oen many monitoring sites, resulng in a large number of model parameters, calls upon dedicated numerical methods to build a fast and stable esmaon algorithm. The original implementaon of this method, TRIM, in Pascal/Delphi, has been in use at Stascal Netherlands and other European countries for more than 2 decades. However, because of the diminishing support for Pascal/Delphi, the desire to extend the methodology and, more generally, for a more versale and beer maintainable system a need emerged to re-implement the method, preferably as open source, in a modern and well supported language. These requirements naturally suggest R as the preferred tool. Addional benefits include the opportunity to link to the rich library set provided by the R environment, allowing easy integraon in the stascal workflow Aer sketching this historical background, we will highlight key characteriscs of the re-implementaon process. For example, connuity of the stascal process called for strong requirements on validaon of the new R implementaon compared to the former Pascal code. We will demonstrate both the core funconality of rtrim and many of the novel ex- tensions. Examples include monthly count data, various types of visualizaon rang- ing from dedicated composite plots and heatmaps to methods to derived uncer- tainty intervals or detect outliers in the data.

uRos2018 59 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

SelEdit… - a collecon of R packages to implement some opmizaon-based selecve eding techniques

By Elisa Esteban, Soledad Saldaña, and David Salgado

Dept. Methodology and Development of Stascal Producon, INE In the quest to industrialise the stascal producon at Stascs Spain (INE) we have recently proposed to follow the principles of funconal modularity to implement a modern producon system. We conceive of such a modern system as a sequence of layers ranging from the definion of the stascal needs/problem over the sta- scal method(s) to provide a soluon to the soware implementaon following both the funconal and object-oriented paradigms to the soware implementaon at a scripng level and finally to a user interface. We have developed a collecon of R packages providing a first proof of concept regarding an opmizaon-based approach to selecve eding proposed and followed at Stascs Spain (INE) in a number of monthly Short-Term Business Stascs. This approach has been comple- mented with a computaonally intensive proposal for the construcon of validaon intervals for each me period,￿each stascal unit, and each variable.￿All our R packages makes an intensive use of the S4 object-oriented system pursuing a direct implementaon of funconal modularity principles with a strong methodological foundaon of the design of modules. We present the current status of the project focusing on the implemenon of the stascal methods going from small funcons in diverse R packages to survey-specific scripts in actual producon.

uRos2018 60 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

The Use of R Shiny at the U.S. Bureau of Labor Stascs

By Brandon Kopp

U.S. Bureau of Labor Stascs Economists and stascians at the U.S. Bureau of Labor Stascs (BLS) have devel- oped several R Shiny applicaons that are or will soon be used at various points throughout the survey lifecycle. The Naonal Compensaon Survey is nearing com- pleon on an app that will allow field supervisors to monitor the progress of data collecon at the naonal, regional, and state levels. The Producer Price Index uses an app which provides me series and other visualizaons to aid in data review. The Consumer Expenditure Survey has developed an app that allows stakeholders to view key data quality metrics such as response rates and imputaon rates over me. The Office of Publicaons is working on a system that will automacally gen- erate news release statements from structured data tables. In this presentaon, I will show examples of how these applicaons are or will be used and discuss chal- lenges we have encountered in developing and deploying these applicaons. I will discuss the use of R-Portable as a workaround for deploying applicaons within an organizaon.

uRos2018 61 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Transforming Health and Social Care Publicaons in Scotland

By Anna Price, David Caldwell, Ewout Jaspers, Maighread Simpson

NHS Naonal Services Scotland Objecves The Informaon Services Division (ISD) of the Naonal Health Service Scotland pro- duces around 200 health and social care publicaons each year which are desig- nated either official or naonal stascs. Most publicaons are produced using SPSS, and published as stac PDF documents with accompanying excel tables. Feed- back has shown that our data can be challenging to find and digest in this format. Furthermore, producon is me-consuming, involving extensive manual formang and checking. The transforming publicaons programme aims to modernise how ISD produces and releases data. Methods The team inially focused on one publicaon as a proof of concept. To make the process more efficient, robust and reliable, we transferred data producon from SPSS to R, using modern data wrangling code from the dyverse suite of packages. To build the new publicaon plaorm we used a combinaon of RShiny dashboards and D3 charts. We used git and GitHub for version control and have published the code behind the RShiny dashboard. We also developed an R Style Guide, GitHub best pracce and a suite of R resources in order to facilitate learning, development and collaborave working within ISD and across the wider public sector in Scotland. Results A prototype for stascal publicaons was released in December 2017, providing customers with the data they need in a way that they can understand. Further- more, the new method of producon has created me savings and reduced the risk of manual errors. We are now working with several teams within ISD to transform their publicaons into this new design. We are also developing the first Reproducible Analycal Pipeline for an official stascs publicaon in Scotland in order to stream- line the producon process further.

uRos2018 62 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Two main uses of R in Stascs Portugal: sampling and confidenality

By Pedro Sousa, Conceição Ferreira, Inês Rodrigues, Pedro Campos

Stascs Portugal, Stascal Methods Unit R has been used in Stascs Portugal since more than 15 years and its use is cur- rently widespread throughout the organizaon. In this presentaon, we focus on the Stascal Methods Unit, where there are two main areas of R usage: sampling and disclosure control. For many of our sampling procedures, R is applied as a primary tool: we make use of packages such as RODBC for database access and survey for data analysis on complex survey samples. With regard to stascal disclosure control, the use of R in Stascs Portugal is strong, given the recent developments concerning packages for protecng the con- fidenality of microdata and tabular data. R package sdcMicro has been a valuable tool in esmang disclosure risk concerning different intruder scenarios, in a quick and friendly manner. R has also played a central role in studying techniques for producing Public Use Files for the Household Budget Survey: parametric and non- parametric methods have been compared regarding their capacity to generate safe and useful synthec data. With respect to Census data, perturbave methods for table protecon have been studied, which included wring R funcons to check for two priories when analyzing usefulness: table consistency and addivity. Besides applying R at the Unit, we encourage its use in Stascs Portugal through systemac four-day courses covering some basic commands. These allow ever more users to manage, analyze or visualize data using R.

uRos2018 63 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Use of Choropleth Maps for Regional Stascs

By Jillian Delaney

Central Stascs Office, Quality Frameworks and Stascal Methods One of the strategic aims of the Irish Central Stascs Office is to provide greater in- sight from our stascal outputs through improved communicaons, including the use of visualisaon. Another key priority of the CSO is to address the demands of Irish users for more detail in regional stascs. To effecvely communicate this ex- tra detail, choropleth maps have proved invaluable. These maps use differences in shading, colouring, or the placing of symbols within predefined areas to indicate the average values of a parcular quanty in those areas. To implement these maps in R, cartographic boundary shapefiles are required. These files are simplified repre- sentaons of selected geographic areas. In this paper, an applicaon of choropleth maps developed in R is described for a project which developed a new approach to measuring and understanding the acvies of the tourism industries in Ireland. In this case, choropleth maps were used to illustrate a measure known as a Tourism Dependency Rao (TDR) at county level. Keywords: visualisaon, R, choropleth maps, regional stascs, boundary files

uRos2018 64 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Using R for analysis and producon of Price Indices for the Producon and Services sector of the economy

By Ma Mayhew

Office for Naonal Stascs At the Office for Naonal Stascs (ONS) R has been used for the analysis and pro- ducon of data used in the calculaon of the Producer Price Index (PPI), its Services (SPPI), Exports (EPI) and Import (IPI) equivalents. This presentaon will cover two uses of R, one in the producon and the second in the analysis. Since 2015 R has been used to draw the samples for the SPPI, EPI, and IPI. This was done to allow a greater harmonisaon between the creaon of the sample with a set of common code, and a harmonisaon of the methodology for creang the sample. This would have been more difficult to do in the exisng SAS system for the SPPI. It also allows for a greater transparency of the methodology. Recently R has been used to analyse the effect of changing the methodology in the PPI and the SPPI to comply with Eurostat regulaon, moving from 5 yearly rebasing to annual chain-linking. R was used due to the high number of strata in the indices and allowed for automaon of analysis, and the clear visualisaon of the results. The use of RMarkdown aided the disseminaon of the results. Using R in both of these processes improved the processing me and increased the amount of analysis done. It has also promoted the use of R for other part of the producon process.

uRos2018 65 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Using R for data cleaning, integraon and esmaon challenges in Stascs Poland - some conclusions aer VIP.ADMIN project

By Beręsewicz Maciej and Pawlikowski Dawid

Centre for Small Area Esmaon, Stascal Office in Poznań, Centre for Urban Stascs, Stascal Office in Poznań The main goal of the VIP.ADMIN project was to improve of the use of administrave sources. This could be achieved either by linking available registers or linking survey data with administrave records. Stascal Office in Poznań took part in the ESS.VIP ADMIN WP6 pilot studies and applicaons which covered linking several registers and sample surveys in order to provide informaon about both marital statuses (de facto and de jure). However, not all registers contain the Person Idenficaon Number (PESEL) and sample sur- veys by definion do not contain such informaon due to privacy issues. Therefore, there was a need for probabilisc record linkage of these sources. In the presentaon we would like to focus on how the problems with data clean- ing, probabilisc record linkage and esmaon based on mulple data sources were tackled with the use of R. The presentaon covers packages to process the data (e.g. dyverse, data.table, stringi), link surveys and registers (e.g. RecordLinkage, fastLink) and esmaon (e.g. survey, laeken).

uRos2018 66 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Using R for variance esmaon in social surveys

By Eleanor Law, Vahé Nafilyan, Ria Sanderson

Office for Naonal Stascs Variance esmaon is an important part of the producon of official stascs and their accuracy, accessibility and comparability. Currently the primary tool used in sample design and esmaon for social surveys at the UK Office for Naonal Stas- cs is a SAS implementaon of the generalised linearised jackknife method of vari- ance esmaon. The UK Government Stascal Service is increasingly focussed on innovaon and cost effecveness, supported by increased use of open source soware. To aid re- use and transparency of methods, we have implemented this method of variance esmaon for complex survey designs in R. We have used the Wealth and Assets survey, a longitudinal survey of household wealth in Great Britain, as a case study to test this R implementaon. This survey features cluster sampling, straficaon, and calibraon so offers a useful compari- son between the R method and the exisng approach. In this presentaon, we will give an overview of the method, how it is implemented in R and the tests that we have carried out. We will discuss next steps, where we are looking to implement more flexibility, and the ability to esmate variance of change between different me points

uRos2018 67 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Using R to access official data from the Guatemalan Naonal Instute of Stascs

By Oscar de León

Centro de Estudios en Salud, Universidad del Valle de Guatemala Although several technical standards are available regarding the preparaon and disseminaon of official stascs, enes producing these data will somemes ex- port it in formats nave to the variety of tools they use for data management and analysis, and then publish these datasets in ad hoc web portals not necessarily well suited for data consumpon by stascal tools like R. Even if the data are available in one of the more common file formats (e.g. SAS, SPSS, or Stata data files, and even Excel spreadsheets), the web portals can prove difficult to even navigate in a web browser, let alone use them as repositories for more automated data access. The Guatemalan Naonal Instute of Stascs (INE) provides several official data resources through a web portal roughly arranged by categories and me periods (https://www.ine.gob.gt/index.php/estadisticas/fuente-base-de-datos, websiteinSpanish), but the available datasets are not clearly documented and are typically discovered through manual browsing of the site. Once a user finds a dataset of interest to her, it is implied by the context which format was used to export the data, and no addional instrucons on how to use the data are provided. This work presents the overall structure of the INE data portal and showcases the nsgtm (Naonal Stascs - Guatemala) R package, which allows users to explore the available INE datasets within R and download them directly for use in an R session. As a movang example, the package is used to find and load the official birth records from 2010-2015, which are spread across several pages.

uRos2018 68 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

Variance esmaon for annual point esmates and net changes for LFS using R package vardpoor

By Juris Breidaks

Central Stascal Bureau of Latvia The presentaon is devoted to the funcon vardannual from R package vardpoor. The Central Stascal Bureau of Latvia in 2017 has developed the funcon vardan- nual which is included in the R package vardpoor. In the paper describes the vari- ance esmaon of quartely esmates, correlaon esmaon of two quarter change esmates, and finally it explains how to extend the approach to deal with variance esmaon for annual point esmates and net changes for Labour Force Survey (LFS) indicators. Variance esmates for annual point esmates and net changes was es- mated for LFS indicators using the funcon “vardannual”. This funcon was tested on simulated and real data. The funcon “vardannual” is important to assess quality of LFS esmates and stascal significance of the esmates. The annual net changes of all indicators are calculated with the confidence interval, and if the confidence in- terval for the difference is not equal to 0, then we are able to conclude that the difference is stascally significant. When looking at the results with calibraon, it can be idenfied that the confidence interval is narrower than the results with- out calibraon. The funcon “vardannual” in soware R package “vardpoor” was implemented in pracce, LFS in Latvia. References BERGER, Y., OSIER, G, GOEDEMÉ, T. (2017). Standard error esmaon and related sampling issue, Monitoring social inclusion in Europe (Eurostat), pp. 465 – 480 BERGER, Y. G. and PRIAM, R. (2016), “A simple variance esmator of change for rotang repeated surveys: an applicaon to the EU-SILC household surveys”, Uni- versity of Southampton, Stascal Sciences Research Instute. Available at http: //eprints.soton.ac.uk/347142 BREIDAKS J., LIBERTS M., IVANOVA, S. (2018). vardpoor: Variance Esmaon for Sample Surveys by the Ulmate Cluster, R package version 0.10.0., URL http:// cran.r-project.org/web/packages/vardpoor/index.html OSIER G. RAYMOND V., (2015) Development of methodology for the esmate of vari-

uRos2018 69 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

ance of annual net changes for LFS-based indicators. Deliverable 1 - Short document with derivaon of the methodology (FINAL), SOGETI OSIER G., PERRAY P., (2016). Variance esmators of annual levels and net changes for a defined set of LFS-based indicators.

uRos2018 70 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb

We hope you enjoyed your me at uRos2018. See you next me!

uRos2018 71