Use of R in O cial Statistics
6th International Conference 2018
2018OV010 Eventbanner uRos2018 Rolbanner 100x200_DEF OPTIES .indd 1 23-7-2018 09:58:34
Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Welcome
The global community of R users is growing, and the number of Na onal and Interna- onal Sta s cal Offices that are adop ng R is growing as well. About five years ago, when this conference was organized as an interna onal conference for the first me in Romania, we felt a bit like outlaws using Free and Open Source So ware (FOSS) in an area where commercial packages rule the land. How mes have changed: in the mean me FOSS, and in par cular R is considered a driving force of innova on in academia, industry and government. The popularity of R is demonstrated by the hundreds of local R user groups, the thousands of R packages, and the RConsor um. The current conference, at Sta s cs Netherlands, marks the first occasion outside of the place where it was conceived: Romania. We are therefore especially pleased that our keynote speakers have roots in both countries. Alina Matei is a professor of sta s cs in Switzerland with Romanian roots. She will talk about op mal sample coordina on using R. An important topic in mes where the reduc on of response burden and increasing nonresponse rates force us to use smaller, more complex sampling methods. Not many R users are aware that there is a ‘touch of Dutch’ in R. Since 2017, Jeroen Ooms (UC Berkeley) is the maintainer of both Rtools and R for Windows. He will tell us about what it takes to compile, release, and modernize a system on which more than 12,500 R packages and millions of users rely every day. For the first me this year we have a full day of tutorials with topics including sample stra fica on, data cleaning and processing, and geospa al modeling. Make sure to take full advantage of the experts that came here to share their knowledge. With about fi y contributed talks and around one hundred conference a endees, this uRos is the largest in its history. We are grateful to the speakers, tutorial orga- nizers and a endees for making this conference such a growing success.
We wish you an conference. Welcome to uRos2018!
uRos2018 iii Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Organizing partners
Sta s cs Netherlands
Sta s cs Romania
Sta s cs Austria
University of Bucharest
Ecological University of Bucharest
Special Journal Issues
Romanian Sta s cal Review: http://www.revistadestatistica.ro/ Austrian Journal of Sta s cs: https://www.ajs.or.at/
uRos2018 iv Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Contents
Welcome iii
Prac cal informa on 1
Program overview 3
Session overview 7
Tutorials 13 Fast & efficient data manipula on with data.table (Jaap Walhout) . . . . 14 Plo ng spa al data in R (Mar jn Tennekes) ...... 15 So your Data is Tidy. But is it Clean? (Edwin de Jonge and Mark van der Loo) 16 Spa al Analysis in R with Open Geodata (Egge-Jan Pollé and Willy Tadema) 17 Use of R package SamplingStrata for the Op mal Stra fica on of Sampling Frames for Mul purpose Sampling Surveys (Giulio Barcaroli) . . . . 18
Keynotes 19 Sample coordina on and R (Alina Matei) ...... 20 The R infrastructure and Windows Build System (Jeroen Ooms) ...... 21
Conference presenta ons 23 A Corporate Design Toolbox for R (Thomas Lo Russo) ...... 24 A First Step towards Sta s cal Disclosure Control on Mul ple Tables Under the Presence of Differen al A acks (Kazuhiro Minami and Yutaka Abe) 25
uRos2018 v Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Alterna ve to LaTeX for high quality report genera on with rmarkdown (Romain Lesur) ...... 26 An all-in-one R applica on for valida ng model assump ons in linear re- gression analysis with visualiza ons (Joy Chioma Nwabueze and Chisimk- wuo John) ...... 27 An internal package for automated metadata documenta on (Ma hias Gomolka) ...... 28 Canadian Consumer Price Index (CPI) Dashboard built using R Shiny (Manolo Malaver-Vojvodic) ...... 29 coder: An R-package for fast classifica on of item data into groups (Erik Bülow) ...... 30 Combining JDemetra+ and R for Analysing and Visualising Time Series in Official Sta s cs (Atanaska Nikolova) ...... 31 Comparison of mul variate outlier detec on methods for nearly ellip - cally distributed data (Kazumi Wada, Mariko Kawano and Hiroe Tsub- aki) ...... 32 Development of R Shiny Dashboard on Pa ern and Characteris cs of Tu- berculosis Incidence in Malaysia (Kamarul Ariffin Mansor, Nurhuda Ismail, Asmahani Nayan and Abd Razak Ahmad) ...... 33 Easily translatable Shiny applica ons (Matjaž Jeran) ...... 34 Easy Bootstrapping for Rota onal Surveys with ’surveysd’ (Johannes Gussen- bauer, Alexander Kowarik and Ma hias Till) ...... 35 Errorlocate: finding errors in data (Edwin de Jonge) ...... 36 Es ma ng Differen al Mortality from EU-SILC UDB Longitudinal Data (To- bias Göllner, Johannes Klotz) ...... 37 Evalua on of es ma on methods for a new survey of the UK’s Office for Na onal Sta s cs (ONS) using R (Konstan nos Soulanis) ...... 38 Evidence for the use of alterna ve data sources to track consumer and business confidence within emerging markets using sen ment based techniques (Hanjo Odenaal) ...... 39 Experiences in the migra on to RStudio-Server in Sta s cs Austria (Bern- hard Meindl and Alexander Kowarik) ...... 40 From challenges to opportuni es: The Romanian Case of Use R in Official Sta s cs (Nicoleta Caragea, Ana-Maria Ciuhu and Raluca Mariana Dragoescu) ...... 41
uRos2018 vi Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
How R is improving the dissemina on of sta s cs within the Department for Work and Pensions (Aoife O’Neill) ...... 42 How the Sco sh Government is moving towards R (Victoria Avila) . . . . 43 Interac ve data visualiza on web-based applica on using R-Shiny (Hous- sam Hachimi) ...... 44 Introducing R at Sta s cs Denmark – a not en rely completed how-to (Pe- ter Tibert Stoltze & other co-authors (to be added)) ...... 45 Introduc on to ’flagr’ (Salva Ma eo, Eurostat and Mészáros Mátyás) . . 46 Inves ga ng Chaos in Time Series: Evidence from the Cryptocurrency Mar- ket (Sami Diaf) ...... 47 Lack-of-fit tes ng without replicates available – a modern clustering ap- proach (Tyler George) ...... 48 Macroeconomic Sta s cal Forecas ng for Engine Demand (Ankit kamboj, Debojyo Samadder and Ambica Rajagopal) ...... 49 Op mal Boundary Value for Crea ng Anonymized Microdata: Empirical Analysis based on Economic Survey Data (Yutaka Abe, Kiyomi Shi- rakawa and Hitotsubashi Ryota Chiba) ...... 50 Overlapping classifica on for autocoding system (Yukako Toko, Shinya Iijima and Mika Sato-Ilic) ...... 51 R packages for op mal stra fied sampling: a review and compared evalu- a on (Marco Ballin and Giulio Barcaroli) ...... 52 pes m - an R package to compute popula on es ma ons using mobile phone data (: Bogdan Oancea, David Salgado, Luis Sanguiao, and Antoniade Ciprian Alexandru) ...... 54 reclin: a package for record linkage and deduplica on (Jan van der Laan (Sta s cs Netherlands)) ...... 55 Responsive, web-based graphical user interfaces to R (Adrian Dușa) . . . 56 (R)evolu on of generalized systems and sta s cal tools at Sta s cs Canada (Susie For er and Steven Thomas) ...... 57 R’s Shiny package and Survey Solu ons for (Ac ve) Survey Management (single slot) (Michael Wild) ...... 58 rtrim – an R implementa on of Trends and Indices in Monitoring data (Patrick Bogaart, Mark van der Loo, Jeroen Pannekoek) ...... 59 SelEdit… - a collec on of R packages to implement some op miza on- based selec ve edi ng techniques (Elisa Esteban, Soledad Saldaña, and David Salgado) ...... 60
uRos2018 vii Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
The Use of R Shiny at the U.S. Bureau of Labor Sta s cs (Brandon Kopp) . 61 Transforming Health and Social Care Publica ons in Scotland (Anna Price, David Caldwell, Ewout Jaspers, Maighread Simpson) ...... 62 Two main uses of R in Sta s cs Portugal: sampling and confiden ality (Pe- dro Sousa, Conceição Ferreira, Inês Rodrigues, Pedro Campos) . . . 63 Use of Choropleth Maps for Regional Sta s cs (Jillian Delaney) ...... 64 Using R for analysis and produc on of Price Indices for the Produc on and Services sector of the economy (Ma Mayhew) ...... 65 Using R for data cleaning, integra on and es ma on challenges in Sta s- cs Poland - some conclusions a er VIP.ADMIN project (Beręsewicz Maciej and Pawlikowski Dawid) ...... 66 Using R for variance es ma on in social surveys (Eleanor Law, Vahé Nafilyan, Ria Sanderson) ...... 67 Using R to access official data from the Guatemalan Na onal Ins tute of Sta s cs (Oscar de León) ...... 68 Variance es ma on for annual point es mates and net changes for LFS using R package vardpoor (Juris Breidaks) ...... 69
uRos2018 viii Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Prac cal informa on
Public wifi
SSID CBS-Public User name uros2018 Password uros2018 Guests understand and acknowledge that we exercise no control over the nature, content, or reliability of the informa on and/or data passing through our network.
Social dinner
The social dinner will take place on Thursday 13 September at 19:00. Restaurant Luden Plein 6–7 2511CR Den Haag The easiest way to get there form CBS is to take the lightrail (tram) number 3 or 4 in the direc on of Den Haag. Get out at stop ‘Spui’, this is the first stop a er the Central Sta on. From there it is two minutes to walk.
uRos2018 1 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
uRos2018 2 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Program overview
uRos2018 3 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
12 September: tutorials
Tinbergen Methorst Idenburg 08:30 Registra on (foyer) 09:00 Spa al Analysis in R So your data is Tidy, Fast and efficient with open Geodata but is it Clean? data manipula on with data.table 10:30 Coffee break (foyer) 11:00 Spa al Analysis in R So your data is Tidy, Fast and efficient with open Geodata but is it Clean? data manipula on with data.table 12:30 Lunch break (foyer) 13:30 Plo ng spa al data Using in R SamplingStrata for op mal stra fica- on in mul purpose sampling surveys 15:00 Coffee break (foyer) 15:30 Plo ng spa al data Using in R SamplingStrata for op mal stra fica- on in mul purpose sampling surveys 17:00 Program ends
uRos2018 4 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
13 September: conference day 1
Tinbergen Methorst 08:30 Registra on (foyer) 09:00 Opening 09:30 Keynote I 10:30 Coffee break (foyer) 11:00 Sampling and es ma on R in organiza on 12:30 Lunch break (foyer) 13:30 Data Cleaning R in produc on: data analysis 15:00 Coffee break (foyer) 15:30 Methods for official sta s cs Shiny applica ons 17:00 Sessions end 19:00 Social dinner at Luden
14 September: conference day 2
Tinbergen Methorst Idenburg 08:30 Registra on (foyer) 09:00 Keynote II 10:00 unconfUROS results 10:30 Coffee break (foyer) 11:00 Time series Reports and GUI pro- R in produc on: au- gramming toma on 12:30 Lunch break (foyer) 13:30 Big data dissemina on and visualiza on 15:00 Closing the confer- ence
uRos2018 5 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
uRos2018 6 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Session overview
uRos2018 7 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Keynote I, 13 Sept. 09:30 Tinbergen Sample coordina on and R Alina Matei Chair: Nicoleta Carragea
Sampling and es ma on, 13 Sept. 11:00 Tinbergen Using R for variance es ma on in so- Eleanor Law; Vahé Nafilyan;Ria cial surveys Sanderson Evalua on of es ma on methods for Konstan nos Soulanis a new survey of the UK’s Office for Na- onal Sta s cs (ONS) using R Easy Bootstrapping for Rota onal Sur- Johannes Gussenbauer; Alexander veys with ’surveysd’ Kowarik; Ma hias Till Variance es ma on for annual point Juris Breidaks es mates and net changes for LFS us- ing R package vardpoor Lack-of-fit Tes ng Without Replicates Tyler George Available – A Modern Clustering Ap- proach Chair: David Salgado
R in organiza on, 13 Sept. 11:00 Methorst From challenges to opportuni es: Nicoleta Caragea; Ana-Maria Ciuhu; The Romanian Case of Use R in Offi- Raluca Mariana Dragoescu cial Sta s cs Official Sta s cs Experiences in the Bernhard Meindl; Alexander Kowarik migra on to RStudio-Server in Sta s- cs Austria (R)evolu on of generalized systems Susie For er;Steven Thomas and sta s cal tools at Sta s cs Canada How the Sco sh Government is mov- Victoria Avila ing towards R Introducing R at Sta s cs Denmark – Peter Tibert Stoltze a not en rely completed how-to Chair: Alexander Kowarik
uRos2018 8 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Data Cleaning, 13 Sept. 13:30 Tinbergen Overlapping classifica on for au- Yukako Toko;Shinya Iijima;Mika Sato- tocoding system Ilic SelEdit... - a collec on of R packages Elisa Esteban; Soledad Saldaña; David to implement some op miza on- Salgado based selec ve edi ng techniques Comparison of mul variate outlier Kazumi Wada; Mariko Kawano; Hiroe detec on methods for nearly ellip - Tsubaki cally distributed data Errorlocate: finding errors in data Edwin de Jonge Chair: Giulio Barcaroli
R in produc on: data analysis, 13 Sept. 13:30 Methorst Using R for analysis and produc on of Ma Mayhew Price Indices for the Produc on and Services sector of the economy Transforming Health and Social Care Anna Price;David Caldwell;Ewout Publica ons in Scotland Jaspers;Maighread Simpson Macroeconomic Sta s cal Forecast- Ankit Kamboj; Debojyo Samadder; ing for Engine Demand Ambica Rajagopal Two main uses of R in Sta s cs Portu- Pedro Sousa; Conceição Ferreira; Inês gal: sampling and confiden ality Rodrigues; Pedro Campos Chair: Ciprian Alexandru
uRos2018 9 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Methods for official sta s cs 13 Sept. 15:30 Tinbergen Op mal Boundary Value for Creat- Kiyomi Shirakawa; Ryota Chiba; Yu- ing Anonymized Microdata: Empiri- taka Abe cal Analysis based on Economic Sur- vey Data Reclin: a package for record linkage Jan van der Laan and deduplica on R packages for op mal stra fied sam- Marco Ballin; Giulio Barcaroli pling: a review and compared evalua- on A First Step towards Sta s cal Disclo- Kazuhiro Minami; Yutaka Abe sure Control on Mul ple Tables Under the Presence of Differen al A acks An All-In-One R Applica on For Vali- Joy Chioma Nwabueze; Chisimkwuo da ng Model Assump ons In Linear John Regression Analysis With Visualiza- ons Chair: Bernhard Meindl
Shiny applica ons, 13 Sept. 15:30 Methorst Rumble, a shiny applica on for Mod- Ferdian Fadly elling Bayesian Linear Es ma on us- ing R-Inla Package The Use of R Shiny at the U.S. Bureau Brandon Kopp of Labor Sta s cs Development of R Shiny Dashboard Kamarul Ariffin Mansor; Nurhuda Is- on Pa ern and Characteris cs of Tu- mail; Asmahani Nayan; Abd Razak Ah- berculosis mad Interac ve data visualiza on web- Houssam Hachimi based applica on using R-Shiny Chair: Guido Schulz
Keynote II, 14 Sept. 09:00 Tinbergen The R infrastructure and Windows Jeroen Ooms build system Chair: Edwin de Jonge
uRos2018 10 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Time Series, 14 Sept. 11:00 Tinbergen Es ma ng Differen al Mortality from Tobias Göllner; Johannes Klotz EU-SILC UDB Longitudinal Data Inves ga ng Chaos in Time Series: Sami Diaf Evidence from the Cryptocurrency Market Combining JDemetra+ and R for Atanaska Nikolova Analysing and Visualising Time Series in Official Sta s cs rtrim – an R implementa on of Trends Patrick Bogaart; Mark van der Loo; and Indices in Monitoring data Jeroen Pannekoek Chair: Kazumi Wada
Report and GUI programming, 14 Sept. 11:00 Methorst A Corporate Design Toolbox for R Thomas Lo Russo Alterna ve to LaTeX for high quality Romain Lesur report genera on with rmarkdown Easily translatable Shiny applica ons Matjaž Jeran Responsive, web-based graphical Adrian Dusa user interfaces to R Chair: Ana Maria Ciuhu
R in produc on: automa on, 14 Sept. 11:00 Idenburg Introduc on to ’flagr’ Salva Ma eo; Mészáros Mátyás An internal package for automated Ma hias Gomolka metadata documenta on Using R for data cleaning, integra on Beręsewicz Maciej; Pawlikowski and es ma on challenges in Sta s- Dawid cs Poland - some conclusions a er VIP.ADMIN project R’s Shiny package and Survey Solu- Michael Wild ons for (Ac ve) Survey Management Chair: Jan van der Laan
uRos2018 11 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Big data, 14 Sept. 13:30 Tinbergen pes m - an R package to compute Bogdan Oancea; David Salgado; Luis popula on es ma ons using mobile Sanguiao; Antoniade Ciprian Alexan- phone data dru Evidence for the use of alterna ve Hanjo Odendaal data sources to track consumer and business confidence within emerging markets using sen ment based tech- niques coder: An R-package for fast classifi- Erik Bülow ca on of item data into groups Chair: Michael Wild Dissemina on and visualisa on, 14 Sept. 13:30 Tinbergen Use of Choropleth Maps for Regional Jillian Delaney Sta s cs How R is improving the dissemina on Aoife O’Neill of sta s cs within the Department for Work and Pensions Using R to access official data from Oscar F. de León the Guatemalan Na onal Ins tute of Sta s cs Canadian Consumer Price Index (CPI) Manolo Malaver-Vojvodic Dashboard built using R Shiny Chair: Patrick Bogaart
uRos2018 12 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Tutorials
uRos2018 13 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Fast & efficient data manipula on with data.table
By Jaap Walhout
Sta s cs Netherlands Descrip on: The data.table package is known for its speed and memory efficiency. It also has a bit of a learning curve. In this task oriented and hand-on tutorial you will get a head start with data.table and learn the beauty of its syntax. A er each explana on, you will be asked to solve a few exercises so you can internalize each concept. Subjects treated include: Fast input and output of data, Introduc on of data.table’s syntax (with a comparison to SQL-syntax), Aggrega on, Upda ng / adding variables by reference, Manipula ng, mul ple variables at once, Chaining opera ons, Power- ful joins, and Flexible reshaping. Background knowledge: Par cipants are expected to be able to solve (simple) data manipula on tasks with base R and/or dyverse-packages. Familiarity with SQL is useful, but not necessary. Requirements: Par cipants should bring their own laptops with a recent version of R (3.4+) and a recent version of data.table (1.10.4+)
uRos2018 14 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Plo ng spa al data in R
By Mar jn Tennekes
Sta s cs Netherlands In this workshop you will learn how to plot spa al data in R by using the tmap pack- age. This package is an implementa on of the grammar of graphics for thema c maps, and resembles the syntax of ggplot2. This package is useful for both explo- ra on and publica on of spa al data, and offers both sta c and interac ve plo ng. For those of you who are unfamiliar with spa al data in R, we will briefly introduce the fundamental packages for spa al data, which are sf and raster. With demon- stra ons and exercises, you will learn how to process spa al objects from various types (polygons, points, lines, rasters, and simple features), and how to plot them. Feel free to bring your own spa al data. Besides plo ng spa al data, we will also discuss the possibili es of publica on. Maps created with tmap can be exported as sta c images, html files, but they can also be embedded in rmarkdown documents and shiny apps. Tennekes, M., 2018, tmap: Thema c Maps in R, Journal of Sta s cal So ware, 84(6), 1-39
uRos2018 15 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
So your Data is Tidy. But is it Clean?
By Edwin de Jonge and Mark van der Loo
Sta s cs Netherlands Data cleaning consists of all ac vi es necessary to make data fit for its intended sta s cal purpose. Typically data cleaning starts with variable extrac on and struc- turing of data. Once data is well-structured, a sufficient level of validity and com- pleteness must be ensured so that the outcome of sta s cal computa ons can be trusted. Validity means that the data values correspond, with reasonable certainty, to their actual values of the real-world proper es they describe. The workshop will give a systema c overview of data cleaning and data valida on methods with prac cal use cases and excercises in R. The following topics will be treated in the workshop: Structured thinking about data cleaning: data quality and the sta s cal value chain, Checking for errors: rule-based and reproducible data valida on, Principles of error localiza on and correc on of errors, Handling missing data: imputa on methods in R, Tracking changes in data and monitoring the effect of data cleaning. Literature: Mark van der Loo and Edwin de Jonge (2018) Sta s cal Data Cleaning with Appli- ca ons in R. John Wiley & Sons. R packages: deduc ve, dcmodify, errorlocate, lumberjack, validate, rspa, simputa on, stringdist, dyverse, VIM
uRos2018 16 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Spa al Analysis in R with Open Geodata
By Egge-Jan Pollé and Willy Tadema
Tensing GIS Consultants, Provincie Groningen This session will consist of a series of hands-on exercises. A er short theore cal introduc ons the a endees will immediately be able to try it themselves. Most of the data used during this training course are available as Open (Geo)data and thus will be downloaded during the exercises themselves. The goal of this tutorial session is to highlight two topics: The revolu on caused by the recent development of the package sf. Today it is more easy than ever before to use R as your command line GIS (Geographic Informa on System). The increasing availability of Open Data, also data with a spa al component, so: Open Geodata. Some two or three weeks before the start of the conference all training material will be published in our GitHub repository, where it will remain available a erwards. The URL of this repository: https://github.com/TWIAV/Spatial_Analysis_ in_R_with_Open_Geodata
uRos2018 17 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Use of R package SamplingStrata for the Op mal Stra fica on of Sampling Frames for Mul purpose Sampling Surveys
By Giulio Barcaroli
ISTAT The aim of this tutorial is to enable the par cipants to learn how to use the R pack- age “SamplingStrata” in order to op mize the design of stra fied samples. The pack- age offers an approach for the determina on of the best stra fica on of a sampling frame, the one that ensures the minimum sample cost under the condi on to sat- isfy precision constraints in a mul variate and mul -domain case. This approach is based on the use of the gene c algorithm: each solu on (i.e. a par cular par - on in strata of the sampling frame) is considered as an individual in a popula on; the fitness of all individuals is evaluated applying the Bethel algorithm to calculate the sampling size sa sfying precision constraints on the target es mates. Func ons in the package allows to: (a) prepare necessary inputs and check their validity; (b) perform the op miza on step choosing the values of the most important parame- ters; (c) assign the op mized strata labels to the sampling frame; (d) select a sample from the new frame accordingly to the best alloca on; (e) test the compliance of the design to precision constraints. The package also allows to consider the an cipate variance when the survey target variables are not available in the frame, but only proxy ones. A comparison to package “stra fica on” (valid for univariate designs) will be illustrated. Exercises will be proposed to par cipants, that are expected to be acquainted with basics of sampling theory. Ballin M and Barcaroli G. 2013. Joint Determina on of op mal Stra fica on and Sample Alloca on Using Gene c Algorithm, Survey Methodology, 39: 369-393 Barcaroli G. 2014. SamplingStrata: An R Package for the Op miza on of Stra fied Sampling. Journal of Sta s cal So ware, 61(4), 1-24. (http://www.jstatsoft. org/v61/i04/) Barcaroli G., Ballin M., Pagliuca D., Willighagen E. and Zarde o D.. 2018 SamplingStrata: Op mal Stra fica on of Sampling Frames for Mul purpose Sampling Surveys. R package version 1.2 (https://CRAN.R-project.org/package=SamplingStrata)
uRos2018 18 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Keynotes
uRos2018 19 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Sample coordina on and R
By Alina Matei
Ins tute of Sta s cs, University of Neuchâtel Sample coordina on seeks to create a probabilis c dependence between the selec- on of two or more samples drawn from the same popula on or from overlapping popula ons. There are numerous applica ons of sample coordina on with varying objec ves (for example, es ma ng a difference when upda ng a sample over me into a repeated sample, if the popula on changes). First, we provide an overview of selected sample coordina on methods. Next, we focus on special cases of sample coordina on based on the use of permanent random numbers. We show coordi- na on for maximum entropy samples and spa ally balanced samples. Links to R packages useful to implement such cases are shown.
uRos2018 20 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
The R infrastructure and Windows Build System
By Jeroen Ooms
rOpenSci group at UC Berkeley Jeroen has wri en over 40 CRAN packages, many of which have become important pieces of the R ecosystem. As of last year he also builds and maintains the official R toolchain and binaries for Windows, and provides provides builds for dozens of c/c++ libraries through the rwinlib organiza on on Github. This talk gives an overview of the R infrastructure and explains what is involved with building everything for Windows, as well as other pla orms. We highlight some of the powerful open source C libraries that provide the founda- on for the cri cal func onality in base R and many CRAN packages, and the chal- lenges of keeping the rapidly evolving ecosystem current. Finally we get a preview of the new version of Rtools, which is planned to ship with the new major release of R in 2019, and provide a modern, scalable build system for R on Windows.
uRos2018 21 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
uRos2018 22 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Conference presenta ons
uRos2018 23 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
A Corporate Design Toolbox for R
By Thomas Lo Russo
Sta s cal Office of the Canton of Zurich The Sta s cal Office of the Canton of Zurich, the regional sta s cal ins tute of the most populous Swiss canton, is a rela vely small organiza on (30 people). We do not rely on a centralized professional publishing and layout process. We are a group of ‘self-publishers’ crea ng Infographics, reports and a broad variety of publica ons on a regular basis. To achieve visual uniformity of our products, we have created a ‘Do it yourself’- Corporate Design Toolbox, which allows to generate nicely forma ed Charts, Excel- Tables and .pdf Reports straight out of R. Thanks to this toolbox we have become much more efficient in genera ng ad-hoc as well as automated outputs ready for publica on. In a next step we are planing to include templates for html-pages cre- ated with .Rmd as well as Shiny. Github-repo: h ps://github.com/sta s kZH/statR
uRos2018 24 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
A First Step towards Sta s cal Disclosure Control on Mul ple Tables Under the Presence of Differen al A acks
By Kazuhiro Minami and Yutaka Abe
Ins tute of Sta s cal Mathema cs / Na onal Sta s cs Center, Na onal Sta s cs Center To perform sta s cal disclosure control (SDC) on mul ple tables is a challenging task because sensi ve informa on can be revealed from intersec ons of mul ple tables involving a common set of variables. This task is par cularly difficult when each table contains a subset of the common variables because the intersec on of those tables corresponds to a subspace of any shape in the mul -dimensional variable domain. To address this issue, we extend our SDC tool for solving the cell suppression prob- lem of two-dimensional tables to support mul -dimensional ones. Our approach is to take mul ple tables as inputs and construct the single common fine-grained mul -dimensional table so that we can represent the constraints of each input ta- ble in an integrated way. Our tool detect possible sensi ve cells, which could be overlooked when examined separately, by solving a cell suppression problem on the mul -dimensional table. In this paper, we describe the design and implementa on of the new SDC tool and show ini al preliminary experimental results. Keywords: sta s cal disclosure control, cell suppression problems, differen al at- tacks
uRos2018 25 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Alterna ve to LaTeX for high quality report genera on with rmarkdown
By Romain Lesur
Ministère de la Jus ce Reproducible document genera on became an easy task due to knitr and rmark- down. In order to generate paged document (pdf), the usual technology stack relies on LaTeX. While standard LaTeX reports have a high quality typese ng, their aca- demic look and feel does not always suit well with a public communica on. More- over, if sta s cal reports have to comply with a corporate design, required customi- sa ons can become an intractable work: customising a LaTeX document is a highly technical task that needs very specific skills. This communica on introduces alterna ve rendering tools that can be used to pro- duce high quality paged documents with R. These so wares (wkhtmltopdf, weasyprint, prince) convert HTML documents to pdf based on a W3C standard named CSS Paged Media (this can be understood as an extension for CSS for print purposes). Since pan- doc 2.0, these pdf engines can seamlessly replace LaTeX in a reproducible workflow with R. This communica on presents the basics of HTML paged document customisa on with CSS and shows that the required skills are the same as the ones needed for websites customisa on. Therefore, an organiza on possessing HTML styling skills can develop high quality report templates for rmarkdown. In order to facilitate the use of this new technology stack, a package in development (https://github.com/RLesur/weasydoc) and a docker image (https://hub. docker.com/r/rlesur/weasydoc) are presented. Keywords (IEEE taxonomy): Reproducibility of results, Publishing
uRos2018 26 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
An all-in-one R applica on for valida ng model assump ons in linear regression analysis with visualiza ons
By Joy Chioma Nwabueze and Chisimkwuo John
Michael Okpara University of Agriculture Linear regression models are less credible and rarely applicable when assump ons underlying their usage are violated. In this paper, we developed an interac ve easy- to-use and teachable all-in-one R package to validate the linear regression assump- ons to the general users. Our approach was based on the applica on of R in dis- semina on of sta s cs. A merit of this approach was that guidelines with proper explana ons regarding each of the jus fica ons were added as footnotes on the val- idated results, making it possible for both Sta s cians and Non-Sta s cians to use the R package. Other benefits were that the applica on reported explana ons on how to rec fy failed assump ons, and also augments necessary graphical jus fica- ons alongside the test of hypothesis results. It therefore served as a handy R tool for fast and free usage in valida ng linear regression model assump ons and appli- ca ons. Use of empirical data showed the viability of the all-in-one R applica on. Keywords: Model assump ons; Linear regression analysis; R sta s cal package; R graphs; Failed assump ons
uRos2018 27 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
An internal package for automated metadata documenta on
By Ma hias Gomolka
Deutsche Bundesbank At the Research Data and Service Centre (RDSC) of the Deutsche Bundesbank, we provide microdata for research purposes. In order to make those datasets intellec- tually accessible for researchers, each dataset is accompanied by a document con- taining relevant metadata. This document – which we call ’Data Report’ – is a pdf file generated by LuaLaTeX using a custom document class. The contents of a Data Report are created within the RDSC as well as in other departments. All informa- on from other departments comes in a standardized spreadsheet. This is where the internal package ‘DataReportR’ comes into play. Using DataReportR, we can (1) populate a whole LaTeX template with the qualita ve informa on from the other departments. (2), we can create tons of small LaTeX tables containing informa on on each variable, such as a detailed descrip on, the type of the variable, the range of availability and others in no me. (3), DataReportR makes it easy to generate code lists as LaTeX tables.This drama cally reduces the amount of me copying and pas ng text or fiddling with LaTeX tables. Also, this makes sure that each Data Re- port s cks to the corporate design rules. To wrap it up, DataReportR automates all repe ve tasks in the genera on of a Data Report.
uRos2018 28 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Canadian Consumer Price Index (CPI) Dashboard built using R Shiny
By Manolo Malaver-Vojvodic
Sta s cs Canada One of Sta s cs Canada’s most important programs is the Consumer Price Index (CPI), which is an indicator that measures the changes in consumer prices experi- enced by Canadians. I would like to present an interac vedata visualiza on appli- ca on for the Canadian CPI at the 6th Conference on the Use of R in Official Sta s- cs. My presenta on will have two main goals. Firstly, I will highlight how we use R and RStudio at Sta s cs Canada in order to develop a dynamic dashboard that enhances the u lity of the publicly available datasets of the Canadian CPI by giv- ing users access to a variety of tools within an easily navigable user-interface. Sec- ondly, I will aim to showcase how the shiny, shinydashboard and plotly packages allow for a personalized customiza on of the applica on’s user-interface, providing a professional interac ve environment for displaying adjustable user-input controls through hover plots. I invite you to try our applica on which is being temporarily hosted at https://kunov.shinyapps.io/(*), and to explore our complete code and datasets which are available at https://github.com/manolo20/. We hope that the presenta on of our CPI dashboard at the 6th Conference on the Use of R in Official Sta s cs will encourage other sta s cal agencies to con nue to imple- ment and experiment with new methods and technologies in the development of innova ve sta s cal tools in order to improve the capacity of their socio-economic analysis. (*) Note: Your na onal sta s cal office might restrict the access to this website. Users can also try our applica on at http://18.191.111.211:3838/cpi_dashboard_ StatCan/ or contact me if there are any issues accessing the web applica on (manolo. [email protected]).
uRos2018 29 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
coder: An R-package for fast classifica on of item data into groups
By Erik Bülow
The Swedish Hip Arthroplasty Register, Department of Orthopaedics, Ins tute of clinical sciences at the Sahlgrenska academy of University of Gothenburg, Sweden The coder package is used to classify items from one dataset, using code data from a secondary source. It was first developed for medical comorbidity data based on hospital visits recorded in a na onal pa ent register. Medical condi ons were reg- istered using standardized codes (ICD-10) and individual codes were summarized by weigh ng all individual comorbidi es for each pa ent (the Charlson and Elixhauser comorbidity indices). Only hospital visits recorded within a specified me frame, compared to individual reference dates from the primary data source, were recog- nized as relevant. The large data sets, as well as the complexity of the classifica on schemes make those calcula ons me consuming. A naïve approach using bare code comparisons and for-loops in R, took approximately 16 hours to run on a laptop computer with 16 GB of RAM. We were then able to reformulate the coding schemes using reg- ular expressions and we op mized our package using the data.table package. The classifica on me were then reduced to a number of seconds. We also compared the coder package to two packages on CRAN with similar scoop, icd and comorbidi- es.icd10. Our package was 6 and 180 mes faster than those. The coder package does not only include classifica on schemes for comorbidity data. It incorporates a general framework for any case where items can be classified using standardized codes. It might therefore be relevant for many tasks involving data management in official sta s cs using big data.
uRos2018 30 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Combining JDemetra+ and R for Analysing and Visualising Time Series in Official Sta s cs
By Atanaska Nikolova
Office for Na onal Sta s cs JDemetra+ is a free, pla orm-independent and open source so ware with various capabili es for me series analysis. The so ware has been developed by the Na- onal Bank of Belgium in coopera on with Deutsche Bundesbank and Eurostat, and is officially recommended by the European Central Bank and Eurostat for the use of seasonal adjustment of me series in official sta s cs. The so ware is wri en in Java and can be called through R with the “rJava” package, and soon the newly de- veloped package - “rJDemetra”. Subsequently, various plots can be generated and integrated into dynamic documents or interac ve applica ons. This presenta on will show examples of seasonal adjustment of official me series performed in R by calling JDemetra+. The outputs from the analysis can then be used in an interac ve Shiny applica on where users can choose to explore various elements of the me series – such as seasonal pa erns or holiday effects, across different me spans. Other capabili es, such as me series benchmarking, temporal disaggrega on and outlier detec on are also available.
uRos2018 31 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Comparison of mul variate outlier detec on methods for nearly ellip cally distributed data
By Kazumi Wada, Mariko Kawano and Hiroe Tsubaki
Na onal Sta s cs Center Mul variate outlier detec on methods have been discussed in the field of official sta s cs for more than a decade, but they may not be widely used yet. It is because mul variate methods are o en computa onally burdensome, and thus it is more difficult to evaluate their outcome. Furthermore, the outliers detected by mul vari- ate methods may vary with different methods, and we may not be able to determine which ones are absolute outliers. We evaluate a few promising methods such as blocked adap ve computa onally ef- ficient outlier nominators (BACON), nearest-neighbor variance es ma on (NNVE), and modified Stahel-Donoho (MSD) es mators using a variety of asymmetrically contaminated datasets. These methods are selected to have diversity of methodol- ogy, since there is no best method for all situa ons. The purpose of the simula on study is to illustrate the difference of their tolerances in various situa ons. Applica on in survey data processing is also discussed.
uRos2018 32 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Development of R Shiny Dashboard on Pa ern and Characteris cs of Tuberculosis Incidence in Malaysia
By Kamarul Ariffin Mansor, Nurhuda Ismail, Asmahani Nayan and Abd Razak Ahmad
Universi Teknologi MARA Tuberculosis (TB) is an infec ous disease which normally spread from a person to another through the air. It is usually caused by a bacteria, namely Mycobacterium tuberculosis, that normally affects a person lungs. Nowadays, TB is one of the top 10 major cause of death and remain to be a major problem in the low and middle income country including the developing countries like Malaysia. Even though TB is curable and preventable, the number of reported incidence and the number of deaths caused by TB is s ll growing worldwide. The same scenario is happening in Malaysia which called for a proac ve ac on beforehand in order to foresee the pat- tern and characteris cs of TB in Malaysia so that ac on can be taken to prevent and cure the disease to spread even on a bigger scale. Thus, this paper described the R Shiny dashboard development for TB incidence in Malaysia. The dashboard will have features where users not only can visualize interac vely the pa erns and character- is cs of TB cases in Malaysia, but also will be able to run a simple guided logis c regression and survival analysis on TB incidence case in Malaysia. The dashboard also will have the capabili es to automa cally updated the informa on and analy- sis when new data is feed to the original data file. The dashboard will help those who work directly to TB preven on and curability by feeding a complete scenario on TB in Malaysia which will help guide them in planning and execute appropriate preven ve ac on.
uRos2018 33 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Easily translatable Shiny applica ons
By Matjaž Jeran
Bank of Slovenia Euro system of central banks is in transforma on from one official language only to mul lingual system. Staff in the central bank is now recruited not only from the member country but from other countries as well. A so ware product of one central bank or sta s cal ins tute may be used in other similar ins tu on in Europe - but it is expected to be translated into the local language. These two reasons mo vate the developers for to get so ware products that are mul lingual - including Shiny applica ons. This paper gives a set of simple recipes how to make Shiny applica ons mul lingual and easily translatable. The paper starts with a simple program (Hello world) and its transforma on from completely natural language dependent version into one that can be translated by a simple transla on tool – text editor – ready available on any pla orm. The paper deals also with the input data and output data, determining other cul- tural features like specific characters support and colla ng sequence. The support of these features is included in opera ng system and R. How to display numeric val- ues within Shiny applica on is the task le to the Shiny applica on developer. The paper also shows the most important snippets of code of some easily translat- able Shiny applica ons used within Bank of Slovenia. Keywords: R so ware, Shiny, GUI, interna onaliza on, localiza on
uRos2018 34 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Easy Bootstrapping for Rota onal Surveys with ’surveysd’
By Johannes Gussenbauer, Alexander Kowarik and Ma hias Till
Sta s cs Austria With the R package surveysd, we present a package for es ma ng standard errors for yearly surveys with and without rota ng panel design. Compley survey designs, e.g. mul stage sampling is supported and can be freely defined. The implemented method for standard error es ma on uses bootstrapping techniques for mul ple consecu ve waves of the survey. It is possible to compute an es mate based on several years, which leads to a reduc on in the standard error, especially for es- mates done on a subgroup of the survey data. The package enables the user to es mate point es mates as well as their standard error on arbitrary subgroups of his/her data. Also the applied point es mate can be chosen freely with some pre- defined point es mates ready-to-use. Finally the results can be visualized in two different ways to gain quick insight in the quality of the es mated results.
uRos2018 35 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Errorlocate: finding errors in data
By Edwin de Jonge
Sta s cs Netherlands An important but undervalued ac vity for sta s cal analysis is data-cleaning. No measurement is perfect, so data o en contain errors. Obvious errors e.g. nega ve ages are easily detected, but observa ons that contain variables that are logically related e.g. marital status and age are more tricky. R package errorlocate allows for pin poin ng errors in observa ons using the Feligi-Holt algorithm and valida on rules from R package validate. The errors can automa cally be removed using a pipe-line syntax.
uRos2018 36 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Es ma ng Differen al Mortality from EU-SILC UDB Longitudinal Data
By Tobias Göllner, Johannes Klotz
Sta s cs Austria Socio-economic differences in mortality have become increasingly important in an era of pension reforms. Some European countries cannot provide any figures on the subject, and available figures are not easily comparable between countries because of different data sources, me periods and stra fica on variables. As part of the FACTAGE project, we developed a new and rela vely easy approach to obtain com- para ve European figures based on harmonized survey sample data. Longitudinal informa on on persons and households of the European Survey on Income and Liv- ing Condi ons (EU-SILC) is extracted from Eurostat’s User Database (UDB) which is available to researchers carrying out sta s cal analyses for scien fic purposes. Ini ally this method was implemented for SAS, but was sub sequen ally translated into R, to increase a poten al user base. The R code consists of func ons and raw code to form a crude pipeline of work. The method allows for differen al mortality es ma ons by different variables, as long as they are part of EU-SILC’s harmonized target variables. While our analyses are from a compara ve European perspec ve, the method is in principle also useable for single country analyses, or small groups of countries. The presenta on will show results of differen al mortality es ma ons using our method and then focus on the implementa on into R. This will open the discussion on how the implementa on can be improved both from an R programmer and from a social scien sts point of view.
uRos2018 37 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Evalua on of es ma on methods for a new survey of the UK’s Office for Na onal Sta s cs (ONS) using R
By Konstan nos Soulanis
Office for Na onal Sta s cs The Annual Survey of Goods and Services (ASGS) is a new survey measuring the de- tailed annual product turnover of UK service sector businesses. Originally it was en- visioned that ASGS es mates would be calculated using a tradi onal design-based expansion es mator. But once that data had been collected, a model-based con- di onal ra o es mator, as used for the ProdCom survey in the UK, was also con- sidered. This presenta on compares these two methods, and describes how they were evaluated to determine which is more efficient for ASGS. As the data cover about 2,000 service products, from 40,000 businesses across 282 different industry classes, R was considered an appropriate tool to deal with the complex calcula ons and visualisa on of the results. This involved the use of several well-known R li- braries and especially the “ dyverse.” The outcome of the inves ga on was that the model-based es mator gave superior results compared to the expansion es - mator, especially for larger sample sizes.
uRos2018 38 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Evidence for the use of alterna ve data sources to track consumer and business confidence within emerging markets using sen ment based techniques
By Hanjo Odenaal
Bureau of Economic Research of South Africa Recent case studies on the construc on of consumer confidence indices using on- line media data have started appearing in literature. These aptly named ‘sen ment indices’ are constructed using text-based analysis. The obvious advantage of text- based measures of economic tracking is the coverage and cost aspect of these al- terna ve surveying methods. We a empt to gauge the feasibility of construc ng online sen ment indices using large amounts of text data as an alterna ve to the conven onal survey method. This paper adopts a quan ta ve framework to provide an indica on of candidate sen ment indices that best reflect the tradi onal survey based confidence indices conducted by the BER. The results indicate that for the consumer confidence index (CCI), the best candidate indices are constructed from News24 data using specialised financial dic onaries, while for the business confidence index (BCI), the Financial Mail data provided good candidate indices. Finally, composite two indices are con- structed using PCA. The resul ng indices provide strong evidence for the use case of alterna ve data sources to track consumer and business confidence within an emerging market such as South Africa using sen ment based techniques.
uRos2018 39 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Experiences in the migra on to RStudio-Server in Sta s cs Austria
By Bernhard Meindl and Alexander Kowarik
Sta s cs Austria In this contribu on we discuss the experiences we have made in the last year in mi- gra ng the R-Users within Sta s cs Austria. In 2018 we started to move selected users from Rstudio Desktop to Rstudio Server. To minimize the impact of changes, we set up a two-phase infrastructure consis ng of a tes ng- and a produc on envi- ronment. In order to properly scale the servers, we aimed for a smooth transi on and not trying to migrate all users at once. Thus, monitoring of the workload on the servers was an issue that we kept in mind. Furthermore, we also created a new training course that was aimed specifically at our exis ng R users which wanted to move to the server infrastructure. In the process, we have developed several new R packages for internal use (for example to connect to databases or to mount shares) that help our staff in the transi on phase. We give an overview of issues that we have encountered and how we tried to tackle them and give insights on the remaining problems that need to be solved and out- line future plans that include using Rstudio Connect which may change the way we operate, callaborate and distribute our work.
uRos2018 40 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
From challenges to opportuni es: The Romanian Case of Use R in Official Sta s cs
By Nicoleta Caragea, Ana-Maria Ciuhu and Raluca Mariana Dragoescu
Na onal Ins tute of Sta s cs, Ecological University of Bucharest, Ins tute of Na onal Economy, Romanian Academy This presenta on is two-fold. One is concerning organiza onal and technical aspects of introducing R to the sta s cal office at Romanian NIS and the other one is focused on teaching R to users in the office. At Romanian NIS, R is most likely used in the social sta s cs. Current developments of the use of R include, but are not limited to: managing data-bases from sta s cal and administra ve sources (data-cleaning, outliers detec on and treatment, data valida on), Data-matching, Record Linkage, Imputa on, Data hashing & anonymiza- on techniques, Small area es ma on (interna onal migra on and poverty indica- tors), Big Data (Romania is part of the ESSnet big data project). Training of sta s cians in the office was realized at the beginning by one-week courses. Currently, teaching R to users in the office consists in six-months training on the job for groups of ten people. This permits an oriented training, with applica ons of daily basis ac vity. We conclude with a series of proposals on the future research opportuni es and other poten al analysis procedures of R in the social sta s cs.
uRos2018 41 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
How R is improving the dissemina on of sta s cs within the Department for Work and Pensions
By Aoife O’Neill
Department for Work and Pensions We would like to present how R is improving the produc on of sta s cs within the Department for Work and Pensions (DWP), the UK’s biggest government depart- ment. DWP administers pension and working age, as well as disability and health related, benefits to around 20 million customers, amoun ng to over €200 billion in expenditure. The sta s cs published by DWP are used by a range of customers such as local government, who rely on our sta s cs for strategizing how they provide ser- vices. With such a large responsibility, it is important that accurate and reliable sta s cs are being produced by the Department. This talk would cover how R Markdown has improved the way we produce sta s cs and how we have benefited from work conducted in other UK government departments. Using R to automate the crea on of our sta s cal summaries means that more resource can be given to improving and developing our sta s cs, which is serendipitous given the changes being made to DWP’s services. Ma Upson, a formal UK civil servant, highlighted how, unlike in academia where there are numerous journal ar cles on a topic, official sta s cs are o en “the single source of truth”. This is especially true in DWP,where the department is responsible for deploying its resources, as well as developing policy, meaning that it is the source of sta s cs on the services it provides. This work makes use of the R Package “gov- style” which applies the formats used in our sta s cal summaries, improving the efficiency of re-crea ng them in R across government.
uRos2018 42 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
How the Sco sh Government is moving towards R
By Victoria Avila
Na onal Records of Scotland In this talk I will present the steps the Sco sh Government and other Sco sh public sector organisa ons are taking to adopt R for their analy cal needs, driven by the promise of reproducible data analysis pipelines, powerful visualisa ons and lower costs. I will outline the main barriers to introducing R in government organisa ons, from IT culture to the career progression of sta s cians, and how the Sco sh Government is overcoming them. I will present some tangible results, how far we have got and our future plans.
uRos2018 43 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Interac ve data visualiza on web-based applica on using R-Shiny
By Houssam Hachimi
High Commission for Planning-Na onal Accounts Department Official sta s cs data are commonly considered of high quality. In order to enhance the u lity of these sta s cs we have to present them in forms that are easy for the user to interpret by using the power of data visualiza on and its tools. For this purpose, this paper exposes the implementa on of an interac ve data visualiza on web applica on en rely in R by using Shiny, a web-based applica on framework for R, which makes it easy to adapt R scripts into user-friendly Shiny applica ons and allows to access your data through an interac ve web browser interface. The High Commission for Planning (HCP) in Morocco is one of many Na onal Sta s cal Organiza ons that have been researching and developing different tools and meth- ods to present their data in a more a rac ve format. Thus, the Na onal Account department related to the HCP has led this ini a ve and has elaborated this data visualiza on project using Na onal Accounts data. Keywords: Official Sta s cs, Data Visualiza on, R Shiny, Web Applica on.
uRos2018 44 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Introducing R at Sta s cs Denmark – a not en rely completed how-to
By Peter Tibert Stoltze & other co-authors (to be added)
Sta s cs Denmark For many years, the produc on of official sta s cs at Sta s cs Denmark has been based on a mix of different technologies, predominantly SAS and Excel. The develop- ment of the sta s cal produc on system has been managed locally, and technology has o en been a ma er of personal preferences. A par al migra on of our produc on systems to R is desirable for several reasons: Many ini a ves within the European Sta s cal System have provided high quality and easy-to-use R-packages; New employees are typically well-versed in R rather than SAS; Improved efficiency is necessary in order to invest in other areas, e.g. Big Data; The current code legacy is not sustainable. The introduc on of R will be accompanied by standardiza on ini a ves based on Generic Sta s cal Business Process Model. Among other things this means that the decomposi on of the en re produc on process in sub-processes will be reflected in the organiza on of the source code. This will allow greater reuse of code to spe- cific processes across sta s cal domains and thus increase efficiency, e.g. in the methodology department. We are currently developing a strategy for introducing R, which has consequences beyond the choice of a preferred technology. Among other things, the strategy will make use of a version management system (e.g. Suberversion or Git) mandatory for all source code. We also plan to dis nguish between development or experimental code, and peer-reviewed produc on-grade code. Finally we put forward the hypothesis, that introducing what at first seems like a rigid framework actually allows for quicker adapta on of changes, e.g. because the expected input and the generated output are both well defined. It also allows for appropriate choice of tools for each process, e.g. Python rather than R or even reuse of exis ng SAS-code.
uRos2018 45 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Introduc on to ’flagr’
By Salva Ma eo, Eurostat and Mészáros Mátyás
Eurostat The object of this paper is to present the R package ’flagr’ that is in development in Eurostat for facilita ng the internal revision of the use of flags and flagging of aggre- gates in dissemina on. The ’flagr’ package provides general func ons following the methodological guidelines suggested by the SDMX for the aggregate. The pack- age provides three different func ons how the individual flags can be transferred to the aggregate. The first one is the hierarchy of the SDMX flags suggested by the implementa on guidelines. This method compares all flags of a given dataset and keeps the flag for the aggregate with the highest score on the SDMX hierarchy or in a personally specified order. The second method counts the occurrences of the flags in the underlying data and the flag for the aggregate will be the flag that has the high- est count. The last method not only counts the frequency of a flag is represented in the dataset, but also it also it takes into account the weight of the individual values, as the contribu on of the corresponding individual value to the aggregate. The flag, which has the highest summed weight, is used for the flag of the aggregate if it is above a certain threshold.
uRos2018 46 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Inves ga ng Chaos in Time Series: Evidence from the Cryptocurrency Market
By Sami Diaf
Informa on Systems and Machine Learning Lab (ISMLL), Universität Hildesheim This paper inves gates the sta s cal proper es of the five main cryptocurrencies by market capitaliza on, since 2013, to determine to nature of their embedded pat- terns. Irrespec ve of the current debate surrounding the cryptocurrency market and the Fintech trends, the results clearly indicate a higher variability of these se- ries coupled with the presence of chao c pa erns. Hence, the use of advanced sta s cal techniques proved to be crucial to shed light on the underlying behavior and the randomness of the process genera ng data. Par cularly, this work ex- plores the nature of memory affec ng the data using fractal analysis, and inspects its predictability by compu ng the maximum Lyapunov exponent to iden fy pos- sible chao c pa erns. This analysis enlightens central bankers and help financial regulators gauging the risks surrounding the development of cryptocurrencies. R packages used: crypto, fractal, pracma, nonlinearTseries, tseriesChaos, coinmar- ketcapr, dplyr, ggplot2.
uRos2018 47 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Lack-of-fit tes ng without replicates available – a modern clustering approach
By Tyler George
Department of Mathema cs, Central Michigan University The classical lack-of-fit (LOF) test for checking the linearity assump on of a linear re- gression model requires replicates in the predictors. Replicates are o en not avail- able in high dimensional data or when data collec on is unstructured; a common occurrence with big data. A new R package called LOFnorep is presented to test for LOF in a linear regression model when replicates are not available. This package u lizes a grouping data methodology entailing three algorithmic steps: first choos- ing a grouping method to group the data, then fi ng linear regression models to each group, and finally, comparing these fits. The developed R package uses clus- tering func ons, including those found in the mclust R package, to form its groups. The new algorithm described performs be er than some exis ng tests in simulated power studies. It is also more robust for tes ng wide varie es of data structures such as sinusoidal, quadra c, and cubic. The results indicate the new algorithm has a higher or equal power in all these data configura ons. Addi onally, it’s applicable to high dimensional and big data.
uRos2018 48 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Macroeconomic Sta s cal Forecas ng for Engine Demand
By Ankit kamboj, Debojyo Samadder and Ambica Rajagopal
Cummins India Ltd. Forecas ng demand is a cri cal issue for driving efficient opera ons in a manufactur- ing firm. Due to this reason firms are concerned to plan their opera ons and strive to improve their forecas ng methods for having an edge over the compe tors in market. The purpose of this paper is to evaluate various shrinkage methods for data containing large numbers of features. Here we focus on Class 8 Group 2 North Amer- ica Heavy Duty (NAHD) market and macroeconomic indicators from ACT research economic database to forecast full 3 months out shipment of engines. Various pre- processing techniques were applied on all the variables and then they were further decomposed by applying Seasonal and Trend decomposi on using Loess (STL) into its components (trend, seasonality and remainder). Then for each pre-processing technique the decomposi on was analysed visually. A er this the rela ve signifi- cance of the variance associated to each decomposed component was u lized to select the appropriate pre-processing technique for all the variables in order to en- sure their sta onarity for reliable forecas ng accuracy. We applied several sta s - cal as well as machine learning methods and obtained an ensemble of them to have minimal error in forecas ng. It was also no ced that there was hardly any increase in accuracy when the number of features is increased beyond 25-30. Following are the few important R packages that were used in our analysis: forecast, forecastHybrid, tseries, readxl, xts, quantmod, e1071, lars. Keywords: Box-Cox Transforma on, Sta onarity, STL Decomposi on, Least Angular Regression, Shrinkage, Lasso, Support Vector Regression
uRos2018 49 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Op mal Boundary Value for Crea ng Anonymized Microdata: Empirical Analysis based on Economic Survey Data
By Yutaka Abe, Kiyomi Shirakawa and Hitotsubashi Ryota Chiba
Hitotsubashi University / Na onal Sta s cs Center, Hitotsubashi University, Na onal Sta s cs Center When ministry or agency of Japan creates s anonymized data of official sta s cs, they o en use top (or bo om) coding to anonymize variables in ques onnaire infor- ma on. Because ministry or agency does not suppose how researchers use anonymized data, they do not do data transforma on such as logarithmic transforma on before the top coding. On the other hand, researchers o en transform data before their analysis, especially to analyze economic survey data. Therefore, we examined the influence of the top coding on linear regression to research suitable top code (upper limit value) for a variable. We used ques onnaire informa on of “2015 Survey of Research and Development”. The survey is designed to provide basic materials for promo ng science and tech- nology in Japan, by studying the research and development (R&D) ac vi es carried out in Japan. We developed an R program to es mate lower bound of the suitable top code, for making guidance of top coding on official anonymized data. First, this program searches borders of groups in a given dataset by decision tree method and sets them possible op ons of the top code’s lower bound, and then select the suitable bound which AIC of the Chow test is the smallest. We stra fied the ques onnaire infor- ma on and the data which a er the top coding fulfilled the selected bound, and transformed each of the stratum. Therea er, we verified the informa on-loss from the top coding in each stratum, by making a comparison between an es mated lin- ear regression of distribu on which deleted by top coding and a linear regression of original distribu on.
uRos2018 50 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Overlapping classifica on for autocoding system
By Yukako Toko, Shinya Iijima and Mika Sato-Ilic
Na onal Sta s cs Center, Research and Development Division, Faculty of Engineering, Informa on and Systems, University of Tsukuba Coding is the classifying of objects based on given classifica on codes, and it is o en required in the field of official sta s cs. We developed the supervised mul class classifier for autocoding that was implemented with R in our previous study. The purpose of our study is efficiently applying this classifier to the coding task of the Family Income and Expenditure Survey in Japan. Although the developed classifier has high accuracy for the autocoding task, for some objects with ambiguous input informa on are s ll incorrectly assigned codes. In analyzing the incorrectly classified data, we found that a classifica on code could not be uniquely determined for some objects because of the seman c problem, interpreta on problem, and insufficiently detailed input informa on. These issues suggest that there is a need for the development of a classifier that lists mul ple can- didates for coding tasks. We propose a new classifier that lists mul ple candidates in descending order of the degree of reliability in output data, and assists experts to select a correct code from the listed candidates. Also, by used of a new reliability score based on weights of entropy, the accuracy and prac cability of the proposed classifier are improved while the advantages of structural simplicity of the algorithm and prac cal calcula on me have remained unchanged. The proposed algorithm is implemented with R to improve its versa lity. During our presenta on, we will present the details of the algorithm of the new classifier including an illustra ve numerical example with survey data. Keywords: Coding, Machine learning, Overlapping classifica on
uRos2018 51 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
R packages for op mal stra fied sampling: a review and compared evalua on
By Marco Ballin and Giulio Barcaroli
Italian Na onal Ins tute of Sta s cs “SamplingStrata” and “stra fica on” are two R packages useful to determine the op mal stra fica on ensuring the best combina on of expected precision levels and sample size. In this presenta on they are compared in par cular with respect to the efficiency of the solu ons, but also to the coverage of the overall sampling process. Package “stra fica on” operates only in univariate cases, where one con nuous variable Y is considered as target. Various algorithms are available to op mize stra - fica on and alloca on, minimizing the total sample while sa sfying a given precision constraint under a given number of strata. Y can coincide with the X variable avail- able in the frame, otherwise the package permits to introduce a model between X and Y and take into account the related an cipated variance. “SamplingStrata” handles a more general problem, with no limits on the number of target (Y) variables and auxiliary (X) frame variables; moreover, the op miza on problem can be solved by considering different domains. Also in this case it is pos- sible to introduce models (one per each Y variable) in order to take into account the an cipated variance. A complete efficiency comparison can be performed in a straigh orward way in the univariate case, where the packages produce the same results (but “stra fica on” in less me), both in the simple case of Y as target variable and in the case where a model between X and Y is considered. The mul variate case comparison can be carried out only introducing a procedure where “stra fica on” is separately applied to the different Y’s and results are combined in order to be confronted with “Sam- plingStrata” ones. Baillargeon S and Rivest L-P 2011. The Construc on of Stra fied designs in R with the package stra fica on. Survey Methodology, 37: 53-65. Baillargeon S and Rivest L-P 2014, stra fica on: Univariate Stra fica on of Survey Popula ons. R package version 2.2-5. (http://CRAN.R-project.org/package= stratification)
uRos2018 52 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Ballin M and Barcaroli G. 2013. Joint Determina on of op mal Stra fica on and Sample Alloca on Using Gene c Algorithm, Survey Methodology, 39: 369-393 Barcaroli G. 2014. SamplingStrata: An R Package for the Op miza on of Stra fied Sampling. Journal of Sta s cal So ware, 61(4), 1-24. (http://www.jstatsoft. org/v61/i04/) Barcaroli G., Ballin M., Pagliuca D., Willighagen E. and Zarde o D.. 2018 SamplingStrata: Op mal Stra fica on of Sampling Frames for Mul purpose Sampling Surveys. R package version 1.2 (https://CRAN.R-project.org/package=SamplingStrata)
uRos2018 53 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
pes m - an R package to compute popula on es ma ons using mobile phone data
By : Bogdan Oancea, David Salgado, Luis Sanguiao, and Antoniade Ciprian Alexandru
University of Bucharest and Na onal Sta s cs Ins tute of Romania, Dept. Methodology and Development of Sta s cal Produc on, INE, Ecological University of Bucharest and Na onal Sta s cs Ins tute of Romania Integra on of the mobile phone data in the produc on of official sta s cs was one of the main goals of the ESSnet Big Data project. In this regard, we developed an R package to compute popula on es mates following a methodology inspired from the ecological sampling techniques. This methodology uses a Bayesian approach that is computa onally intensive but allows a straigh orward code paralleliza on. Since some func ons of our package are very demanding from a computa onal point of view, we implemented them in C++ and integrated with the rest of the package using Rcpp. More, the C++ code is also parallelized, and we chose RcppPa- rallel package for this purpose. The es ma on procedure combines mobile phone data sets with another data source which can be a popula on register and produces es mates for each territorial division of a geographical area and along a sequence of me instants for which we have data from Mobile Network Operators. One of the hypotheses of the underlaying mathema cal model was the independence of the es mates between different cells which allowed us to use a parallel procedure to perform computa ons for each cell. Besides popula on es mates, it provides a set of accuracy indicators. pes m was developed with an eye on portability and it can be used on both Windows and Unix-like opera ng systems. We tested our package using synthe c generated data and showed that it has a good scalability which is an essen al characteris c for real mobile phone data.
uRos2018 54 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
reclin: a package for record linkage and deduplica on
By Jan van der Laan (Sta s cs Netherlands)
Sta s cs Netherlands The goal of record linkage and deduplica on is to determine which records belong to the same en es. When the records belong to different sources the term record linkage is used; when the records come from the same source the term deduplica- on is used. When a unique key the iden fies the en es is available for all records this process is simple and there are plenty of rou nes is R that can be used (e.g. merge or dplyr::le _join). However, in case the linkage key consists of mul ple vari- ables that can contain errors and are not necessarily unique (such as name, address, date of birth), the process becomes more complex. For this the reclin package has been developed. It implements the Felligi-Sunter method of probabilis c record linkage, but it is set up as a toolbox that can also be used to assist in using other methods such as machine learning. The general process of probabilis c record link- age will be presented together with how the reclin package assists this process
uRos2018 55 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Responsive, web-based graphical user interfaces to R
By Adrian Dușa
University of Bucharest There are a fair number of graphical user interface in and for R, in various program- ming languages such as Java (JGR, Deducer, Glotaran), Python (OpenMeta-Analyst), either as standalone so ware (such as Ra le) or available as internal R packages (for instance R Commander). Lawrece and Verzani (2012) present no less than four different frameworks to de- velop GUIs in and for R: gWidgets, RGtk2, Qt and Tcl/Tk, but interes ngly they did not touch on yet another, currently fashionable environment like Javascript. The well known RStudio is essen ally a web page, and modern interfaces to R can be wri en using the shiny package, yet another tool to make R available via a webpage. This presenta on introduces a different kind of web-based graphical user interfaces. Star ng from an ini al applica on based on the shiny package, it is now evolving into a self-contained, installable so ware using the Electron.js framework.
uRos2018 56 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
(R)evolu on of generalized systems and sta s cal tools at Sta s cs Canada
By Susie For er and Steven Thomas
Sta s cs Canada As part of its moderniza on ini a ve, Sta s cs Canada is moving beyond a survey- first approach with leading edge methods to integrate and transform a variety of data sources intosta s cal informa on to support evidence-based decision-making. To go along with the development of new methods that address emerging data avail- ability and data needs, new computer so ware op ons must be studied, developed and implemented within the Sta s cs Canada’s environment. A large aspect of this work is concentra ng on the adop on, use and support of open-source solu ons such as R. At the same me, con nuous support must be provided to the set of ro- bust methodological tools developed throughout the years to apply methods said to be “tradi onal”. This talk will present Sta s cs Canada’s experience with the ex- plora on of R as a research and produc on tool. Issues related to governance, risk assessment and capacity building will be covered, as will other opportuni es and challenges.
uRos2018 57 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
R’s Shiny package and Survey Solu ons for (Ac ve) Survey Management (single slot)
By Michael Wild
World Bank This tutorial will give an introduc on on how to combine R’s powerful shiny package and the Survey Solu ons REST API, to manage surveys in different survey modes. The focus will mainly be on Face-to-Face and Web surveys, however the interfaces we create can also be used for telephone surveys. We will use remote sensing data, to determine frame imperfec on, as well as survey and other survey paradata to monitor data quality. With this data at hand, we will create different dashboards which facilitate the work of survey managers, and allow for monitoring of the in- coming data quality almost in real me. Shiny is an open source R package that provides an elegant and powerful web frame- work for building web applica ons using R. Shiny helps you turn your analyses into interac ve web applica ons without requiring HTML, CSS, or JavaScript knowledge. Survey Solu ons is a powerful Computer Assisted Survey System, which includes data and survey management func onali es plus a highly versa le REST API. This REST API allows to connect fully customized survey and data management tools, which take the ac ve survey management paradigm to the next level Besides shiny we will also make use of the following packages: data.table (Dowle et al., 2018) , dplyr (Wickham, 2017), h r (Wickham, 2017), raster (Hijmans et al. 2017), and the new sf (Pebesma et al., 2018) package.
uRos2018 58 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
rtrim – an R implementa on of Trends and Indices in Monitoring data
By Patrick Bogaart, Mark van der Loo, Jeroen Pannekoek
Sta s cs Netherlands In the Netherlands, ecological monitoring data for birds, bu erflies, and other faunal groups are collected by NGO’s such as the Dutch Bu erfly Conserva on and analysed by Sta s cs Netherlands to compute na onal or regional trends in annual species abundance. The methodology used is an extended form of Poison regression, im- plemented as a GLM, but allowing for e.g. overdispersion, serial correla on and up- scaling from mul ple subregions. The specific nature of the monitoring data, with o en many monitoring sites, resul ng in a large number of model parameters, calls upon dedicated numerical methods to build a fast and stable es ma on algorithm. The original implementa on of this method, TRIM, in Pascal/Delphi, has been in use at Sta s cal Netherlands and other European countries for more than 2 decades. However, because of the diminishing support for Pascal/Delphi, the desire to extend the methodology and, more generally, for a more versa le and be er maintainable system a need emerged to re-implement the method, preferably as open source, in a modern and well supported language. These requirements naturally suggest R as the preferred tool. Addi onal benefits include the opportunity to link to the rich library set provided by the R environment, allowing easy integra on in the sta s cal workflow A er sketching this historical background, we will highlight key characteris cs of the re-implementa on process. For example, con nuity of the sta s cal process called for strong requirements on valida on of the new R implementa on compared to the former Pascal code. We will demonstrate both the core func onality of rtrim and many of the novel ex- tensions. Examples include monthly count data, various types of visualiza on rang- ing from dedicated composite plots and heatmaps to methods to derived uncer- tainty intervals or detect outliers in the data.
uRos2018 59 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
SelEdit… - a collec on of R packages to implement some op miza on-based selec ve edi ng techniques
By Elisa Esteban, Soledad Saldaña, and David Salgado
Dept. Methodology and Development of Sta s cal Produc on, INE In the quest to industrialise the sta s cal produc on at Sta s cs Spain (INE) we have recently proposed to follow the principles of func onal modularity to implement a modern produc on system. We conceive of such a modern system as a sequence of layers ranging from the defini on of the sta s cal needs/problem over the sta- s cal method(s) to provide a solu on to the so ware implementa on following both the func onal and object-oriented paradigms to the so ware implementa on at a scrip ng level and finally to a user interface. We have developed a collec on of R packages providing a first proof of concept regarding an op miza on-based approach to selec ve edi ng proposed and followed at Sta s cs Spain (INE) in a number of monthly Short-Term Business Sta s cs. This approach has been comple- mented with a computa onally intensive proposal for the construc on of valida on intervals for each me period,each sta s cal unit, and each variable.All our R packages makes an intensive use of the S4 object-oriented system pursuing a direct implementa on of func onal modularity principles with a strong methodological founda on of the design of modules. We present the current status of the project focusing on the implemen on of the sta s cal methods going from small func ons in diverse R packages to survey-specific scripts in actual produc on.
uRos2018 60 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
The Use of R Shiny at the U.S. Bureau of Labor Sta s cs
By Brandon Kopp
U.S. Bureau of Labor Sta s cs Economists and sta s cians at the U.S. Bureau of Labor Sta s cs (BLS) have devel- oped several R Shiny applica ons that are or will soon be used at various points throughout the survey lifecycle. The Na onal Compensa on Survey is nearing com- ple on on an app that will allow field supervisors to monitor the progress of data collec on at the na onal, regional, and state levels. The Producer Price Index uses an app which provides me series and other visualiza ons to aid in data review. The Consumer Expenditure Survey has developed an app that allows stakeholders to view key data quality metrics such as response rates and imputa on rates over me. The Office of Publica ons is working on a system that will automa cally gen- erate news release statements from structured data tables. In this presenta on, I will show examples of how these applica ons are or will be used and discuss chal- lenges we have encountered in developing and deploying these applica ons. I will discuss the use of R-Portable as a workaround for deploying applica ons within an organiza on.
uRos2018 61 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Transforming Health and Social Care Publica ons in Scotland
By Anna Price, David Caldwell, Ewout Jaspers, Maighread Simpson
NHS Na onal Services Scotland Objec ves The Informa on Services Division (ISD) of the Na onal Health Service Scotland pro- duces around 200 health and social care publica ons each year which are desig- nated either official or na onal sta s cs. Most publica ons are produced using SPSS, and published as sta c PDF documents with accompanying excel tables. Feed- back has shown that our data can be challenging to find and digest in this format. Furthermore, produc on is me-consuming, involving extensive manual forma ng and checking. The transforming publica ons programme aims to modernise how ISD produces and releases data. Methods The team ini ally focused on one publica on as a proof of concept. To make the process more efficient, robust and reliable, we transferred data produc on from SPSS to R, using modern data wrangling code from the dyverse suite of packages. To build the new publica on pla orm we used a combina on of RShiny dashboards and D3 charts. We used git and GitHub for version control and have published the code behind the RShiny dashboard. We also developed an R Style Guide, GitHub best prac ce and a suite of R resources in order to facilitate learning, development and collabora ve working within ISD and across the wider public sector in Scotland. Results A prototype for sta s cal publica ons was released in December 2017, providing customers with the data they need in a way that they can understand. Further- more, the new method of produc on has created me savings and reduced the risk of manual errors. We are now working with several teams within ISD to transform their publica ons into this new design. We are also developing the first Reproducible Analy cal Pipeline for an official sta s cs publica on in Scotland in order to stream- line the produc on process further.
uRos2018 62 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Two main uses of R in Sta s cs Portugal: sampling and confiden ality
By Pedro Sousa, Conceição Ferreira, Inês Rodrigues, Pedro Campos
Sta s cs Portugal, Sta s cal Methods Unit R has been used in Sta s cs Portugal since more than 15 years and its use is cur- rently widespread throughout the organiza on. In this presenta on, we focus on the Sta s cal Methods Unit, where there are two main areas of R usage: sampling and disclosure control. For many of our sampling procedures, R is applied as a primary tool: we make use of packages such as RODBC for database access and survey for data analysis on complex survey samples. With regard to sta s cal disclosure control, the use of R in Sta s cs Portugal is strong, given the recent developments concerning packages for protec ng the con- fiden ality of microdata and tabular data. R package sdcMicro has been a valuable tool in es ma ng disclosure risk concerning different intruder scenarios, in a quick and friendly manner. R has also played a central role in studying techniques for producing Public Use Files for the Household Budget Survey: parametric and non- parametric methods have been compared regarding their capacity to generate safe and useful synthe c data. With respect to Census data, perturba ve methods for table protec on have been studied, which included wri ng R func ons to check for two priori es when analyzing usefulness: table consistency and addi vity. Besides applying R at the Unit, we encourage its use in Sta s cs Portugal through systema c four-day courses covering some basic commands. These allow ever more users to manage, analyze or visualize data using R.
uRos2018 63 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Use of Choropleth Maps for Regional Sta s cs
By Jillian Delaney
Central Sta s cs Office, Quality Frameworks and Sta s cal Methods One of the strategic aims of the Irish Central Sta s cs Office is to provide greater in- sight from our sta s cal outputs through improved communica ons, including the use of visualisa on. Another key priority of the CSO is to address the demands of Irish users for more detail in regional sta s cs. To effec vely communicate this ex- tra detail, choropleth maps have proved invaluable. These maps use differences in shading, colouring, or the placing of symbols within predefined areas to indicate the average values of a par cular quan ty in those areas. To implement these maps in R, cartographic boundary shapefiles are required. These files are simplified repre- senta ons of selected geographic areas. In this paper, an applica on of choropleth maps developed in R is described for a project which developed a new approach to measuring and understanding the ac vi es of the tourism industries in Ireland. In this case, choropleth maps were used to illustrate a measure known as a Tourism Dependency Ra o (TDR) at county level. Keywords: visualisa on, R, choropleth maps, regional sta s cs, boundary files
uRos2018 64 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Using R for analysis and produc on of Price Indices for the Produc on and Services sector of the economy
By Ma Mayhew
Office for Na onal Sta s cs At the Office for Na onal Sta s cs (ONS) R has been used for the analysis and pro- duc on of data used in the calcula on of the Producer Price Index (PPI), its Services (SPPI), Exports (EPI) and Import (IPI) equivalents. This presenta on will cover two uses of R, one in the produc on and the second in the analysis. Since 2015 R has been used to draw the samples for the SPPI, EPI, and IPI. This was done to allow a greater harmonisa on between the crea on of the sample with a set of common code, and a harmonisa on of the methodology for crea ng the sample. This would have been more difficult to do in the exis ng SAS system for the SPPI. It also allows for a greater transparency of the methodology. Recently R has been used to analyse the effect of changing the methodology in the PPI and the SPPI to comply with Eurostat regula on, moving from 5 yearly rebasing to annual chain-linking. R was used due to the high number of strata in the indices and allowed for automa on of analysis, and the clear visualisa on of the results. The use of RMarkdown aided the dissemina on of the results. Using R in both of these processes improved the processing me and increased the amount of analysis done. It has also promoted the use of R for other part of the produc on process.
uRos2018 65 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Using R for data cleaning, integra on and es ma on challenges in Sta s cs Poland - some conclusions a er VIP.ADMIN project
By Beręsewicz Maciej and Pawlikowski Dawid
Centre for Small Area Es ma on, Sta s cal Office in Poznań, Centre for Urban Sta s cs, Sta s cal Office in Poznań The main goal of the VIP.ADMIN project was to improve of the use of administra ve sources. This could be achieved either by linking available registers or linking survey data with administra ve records. Sta s cal Office in Poznań took part in the ESS.VIP ADMIN WP6 pilot studies and applica ons which covered linking several registers and sample surveys in order to provide informa on about both marital statuses (de facto and de jure). However, not all registers contain the Person Iden fica on Number (PESEL) and sample sur- veys by defini on do not contain such informa on due to privacy issues. Therefore, there was a need for probabilis c record linkage of these sources. In the presenta on we would like to focus on how the problems with data clean- ing, probabilis c record linkage and es ma on based on mul ple data sources were tackled with the use of R. The presenta on covers packages to process the data (e.g. dyverse, data.table, stringi), link surveys and registers (e.g. RecordLinkage, fastLink) and es ma on (e.g. survey, laeken).
uRos2018 66 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Using R for variance es ma on in social surveys
By Eleanor Law, Vahé Nafilyan, Ria Sanderson
Office for Na onal Sta s cs Variance es ma on is an important part of the produc on of official sta s cs and their accuracy, accessibility and comparability. Currently the primary tool used in sample design and es ma on for social surveys at the UK Office for Na onal Sta s- cs is a SAS implementa on of the generalised linearised jackknife method of vari- ance es ma on. The UK Government Sta s cal Service is increasingly focussed on innova on and cost effec veness, supported by increased use of open source so ware. To aid re- use and transparency of methods, we have implemented this method of variance es ma on for complex survey designs in R. We have used the Wealth and Assets survey, a longitudinal survey of household wealth in Great Britain, as a case study to test this R implementa on. This survey features cluster sampling, stra fica on, and calibra on so offers a useful compari- son between the R method and the exis ng approach. In this presenta on, we will give an overview of the method, how it is implemented in R and the tests that we have carried out. We will discuss next steps, where we are looking to implement more flexibility, and the ability to es mate variance of change between different me points
uRos2018 67 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Using R to access official data from the Guatemalan Na onal Ins tute of Sta s cs
By Oscar de León
Centro de Estudios en Salud, Universidad del Valle de Guatemala Although several technical standards are available regarding the prepara on and dissemina on of official sta s cs, en es producing these data will some mes ex- port it in formats na ve to the variety of tools they use for data management and analysis, and then publish these datasets in ad hoc web portals not necessarily well suited for data consump on by sta s cal tools like R. Even if the data are available in one of the more common file formats (e.g. SAS, SPSS, or Stata data files, and even Excel spreadsheets), the web portals can prove difficult to even navigate in a web browser, let alone use them as repositories for more automated data access. The Guatemalan Na onal Ins tute of Sta s cs (INE) provides several official data resources through a web portal roughly arranged by categories and me periods (https://www.ine.gob.gt/index.php/estadisticas/fuente-base-de-datos, websiteinSpanish), but the available datasets are not clearly documented and are typically discovered through manual browsing of the site. Once a user finds a dataset of interest to her, it is implied by the context which format was used to export the data, and no addi onal instruc ons on how to use the data are provided. This work presents the overall structure of the INE data portal and showcases the nsgtm (Na onal Sta s cs - Guatemala) R package, which allows users to explore the available INE datasets within R and download them directly for use in an R session. As a mo va ng example, the package is used to find and load the official birth records from 2010-2015, which are spread across several pages.
uRos2018 68 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
Variance es ma on for annual point es mates and net changes for LFS using R package vardpoor
By Juris Breidaks
Central Sta s cal Bureau of Latvia The presenta on is devoted to the func on vardannual from R package vardpoor. The Central Sta s cal Bureau of Latvia in 2017 has developed the func on vardan- nual which is included in the R package vardpoor. In the paper describes the vari- ance es ma on of quartely es mates, correla on es ma on of two quarter change es mates, and finally it explains how to extend the approach to deal with variance es ma on for annual point es mates and net changes for Labour Force Survey (LFS) indicators. Variance es mates for annual point es mates and net changes was es - mated for LFS indicators using the func on “vardannual”. This func on was tested on simulated and real data. The func on “vardannual” is important to assess quality of LFS es mates and sta s cal significance of the es mates. The annual net changes of all indicators are calculated with the confidence interval, and if the confidence in- terval for the difference is not equal to 0, then we are able to conclude that the difference is sta s cally significant. When looking at the results with calibra on, it can be iden fied that the confidence interval is narrower than the results with- out calibra on. The func on “vardannual” in so ware R package “vardpoor” was implemented in prac ce, LFS in Latvia. References BERGER, Y., OSIER, G, GOEDEMÉ, T. (2017). Standard error es ma on and related sampling issue, Monitoring social inclusion in Europe (Eurostat), pp. 465 – 480 BERGER, Y. G. and PRIAM, R. (2016), “A simple variance es mator of change for rota ng repeated surveys: an applica on to the EU-SILC household surveys”, Uni- versity of Southampton, Sta s cal Sciences Research Ins tute. Available at http: //eprints.soton.ac.uk/347142 BREIDAKS J., LIBERTS M., IVANOVA, S. (2018). vardpoor: Variance Es ma on for Sample Surveys by the Ul mate Cluster, R package version 0.10.0., URL http:// cran.r-project.org/web/packages/vardpoor/index.html OSIER G. RAYMOND V., (2015) Development of methodology for the es mate of vari-
uRos2018 69 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
ance of annual net changes for LFS-based indicators. Deliverable 1 - Short document with deriva on of the methodology (FINAL), SOGETI OSIER G., PERRAY P., (2016). Variance es mators of annual levels and net changes for a defined set of LFS-based indicators.
uRos2018 70 Eventbanner uRos2018 1920x400.jpg Eventbanner uRos2018 1920x400.bb
We hope you enjoyed your me at uRos2018. See you next me!
uRos2018 71