Services concerning ethical, communicational, skills issues and methodological cooperation related to the use of Big Data in European statistics

(Contract number 11104.205.005-2015.799)

Development of a training strategy to bridge the big data skills gap in European official statistics

Version 2

Date of the 30 November 2017 report:

Drafted by: JOZEF STEFAN INSTITUTE Inna NOVALIJA

Marko GROBELNIK

Disseminated: EUROSTAT: Albrecht WIRTHMANN

1

Neither the European Commission nor any other person acting on behalf of the Commission is responsible for the use that might be made of the following information.

The information and views set out in this report are those of the author(s) and do not necessarily reflect the official opinion of the European Union. Neither the European Union institutions or bodies nor any person acting on their behalf may be held responsible for the use which may be made of the information contained therein.

2

Table of contents List of ...... 5 List of Figures ...... 5 1. Executive summary ...... 7 2. Skills required to process and analyze big data sources for the purpose of official statistics ...... 8 2.1 Background ...... 8 2.1.1 Defining big data ...... 8 2.1.2 Related studies and initiatives ...... 9 2.1.2.1 EDSA ...... 9 2.1.2.2 BDVA reports ...... 11 2.1.2.3 Related MOOCs ...... 12 2.2 Data collection ...... 15 2.2.1 Methodology for data collection ...... 15 2.2.2 Data statistics ...... 17 2.2.3 Literature analysis ...... 18 2.3 Skills analysis ...... 18 2.3.1 Methodology for skills analysis ...... 18 2.3.2 Clustering skills with OntoGen ...... 19 2.3.3 Analysis by skills groups...... 20 2.4 Results ...... 25 2.4.1 Trending skills ...... 26 2.4.2 Correlated skills ...... 29 2.4.3 Skills from literature analysis ...... 33 2.5 Discussion ...... 33 3. Existing skills in the statistical offices of the ESS, Eurostat and NSIs ...... 37 3.1 Existing skills overview ...... 37 4. Analysis of the training needs of the statistical offices ...... 39 4.1 Big data training needs survey overview ...... 40 4.2 Big data training needs survey results ...... 40 4.3 Big data training needs survey summary ...... 49 4.4 Towards bridging the skills gap for big data in statistics ...... 52 5. Training objectives and content for the design of a training program ...... 52 5.1 Learning models and curriculum design approaches ...... 52 5.1.1 Learning models ...... 52

3

5.1.2 Curricula guidelines ...... 55 5.2 Related curricula and classifications ...... 55 5.2.1 ACM classification for computer science ...... 55 5.2.2 Curriculum development in EDISON project ...... 56 5.2.3 Curriculum development in EDSA project ...... 57 5.3 Training objectives for statistical offices in Europe in the area of Big Data ...... 61 5.3.1 Defining training objectives ...... 61 5.4 Content development in the area of Big Data...... 71 5.4.1 ESTP content ...... 71 5.4.2 Data science content dashboard ...... 73 6. Strategic analysis of bridging the gap via training ...... 85 6.1 Training channels ...... 85 6.1.1 Advantages and disadvantages of face-to-face training ...... 85 6.1.2 European Statistics Training Programme ...... 86 6.1.3 European Master in Official Statistics ...... 86 6.1.4 Online Learning ...... 86 6.1.5 Other possible training channels ...... 89 6.2 Defining strategic training plan for Big Data in official statistics...... 90 6.2.1 ADDIE model ...... 90 6.2.2 Analysis ...... 90 6.2.3 Design ...... 91 6.2.4 Development ...... 91 6.2.5 Implementation ...... 92 6.2.6 Evaluation ...... 93 7. References ...... 94 8. Annex ...... 97 8.1 Appendix: Recommended Literature Sources in the Area of Big Data ...... 97 8.2 Appendix: Trending Skills by Groups ...... 103 8.3 Appendix: Correlated Skills by Groups ...... 111 8.4 Appendix: Correlated Skills for Statistical Tools and Technologies ...... 119 8.5 Appendix: Popularity of Tools and Technologies from Literature Analysis...... 129 8.6 Appendix: Big Data Training Needs Questionnaire Form ...... 133 8.7 Appendix: Learning outcomes defined for CF-DS competences and different mastery/proficiency levels ...... 135

4

List of Tables Table 1: Running/Planned Big Data MOOCs ------12 Table 2: Number of Job Postings by Country ------17 Table 3: Skill Groups (Soft Skills, Tasks and Methods) ------21 Table 4: Skill Groups (Tools and Technologies) ------23 Table 5: Highly Demanded Skills (Soft Skills, Tasks and Methods) ------27 Table 6: Highly Demanded Skills (Tools and Technologies) ------28 Table 7: Emerging Skills ------29 Table 8: Skills Classification as Basis for Questionnaire ------33 Table 9: Skills Based on LinkedIn Experiment According to Skills Framework ------38 Table 10: Skills from Big Data Training Needs Survey According to Skills Framework ------50 Table 11: Knowledge Levels for Learning Outcomes in Data Science Model Curricula (MC-DS) ------57 Table 12: Core EDSA Curriculum, version 1 ------58 Table 13: Core EDSA Curriculum, version 3 ------59 Table 14: Recommendations for EDSA Curriculum Development ------60 Table 15: Training Objectives Mapped to Big Data Training Needs ------61 Table 16: Related ESTP Courses ------71 Table 17: ESTP Content Mapped to Training Objectives ------72 Table 18: Web Content Mapped to Training Objectives ------74 Table 19: Videolectures in the Area of Big Data------87 Table 20: Expected Outcomes for Big Data in Statistics ------90

List of Figures Figure 1: Big Data MOOCs Skills ------15 Figure 2: Data Acquisition and Enrichment Pipeline ------16 Figure 3: Example of Job Postings Wikification with JSI Wikifier ------16 Figure 4: Top Locations for Data Analytics in Europe ------17 Figure 5: Jobs Posting Content Clustering with OntoGen ------19 Figure 6: Jobs Postings Skills Clustering with OntoGen ------20 Figure 7: Skills by Groups ------21 Figure 8: Technologies, Tasks and Soft Skills Trends ------26 Figure 9: Statistical Tasks and Methods ------26

5

Figure 10: Tools and Technologies Trends by Groups ------27 Figure 11: Correlated Skills for Statistical Tasks ------30 Figure 12: Correlated Skills for Statistics and Business Intelligence Skill Group ------30 Figure 13: Correlated Skills for the Excel Skill ------31 Figure 14: Correlated Skills for the SAS Skill ------32 Figure 15: Correlated Skills for the Matlab Skill ------32 Figure 16: Tools and Technologies by Groups from Literature Analysis ------33 Figure 17: LinkedIn Experiment ------38 Figure 18: Big Data Training Needs Survey - Answers Map------40 Figure 19: Bloom’s Taxonomy ------53 Figure 20: EDSA Dashboard – Hadoop Search ------73 Figure 21: EDSA Dashboard – Hadoop Related Trainings and Videolectures ------74 Figure 22: EDSA Online Courses Portal ------88 Figure 23: EDSA Learning Pathways ------89

6

1. Executive summary This document outlines the findings and recommendations related to:

- the identification of skills required for the use of big data sources, - the analysis of training needs from the statistical offices of ESS and - the definition of training objectives and content in the area of big data for official statistics.

The results from the skill analysis give a clear indication about individual big data, data science and statistical skills as well as skill groups that are currently in demand all over Europe. The report addresses skill groups at different levels of the big data Skills Framework, such as soft skills, tasks and methods (statistical and data science tasks, administrative tasks for statistical purposes, information technologies tasks for statistical purposes), tools and technologies (big data skills related to platform architecture, big data skills related to statistics and business intelligence, big data skills related to data management, big data skills related to cloud technologies, big data skills related to data mining tools, big data skills related to databases, big data skills related to Hadoop technology, big data skills related to programming languages, big data skills related to search, big data skills related to visualization technologies, upper level data science skills).

The skill analysis shows growing trends in different big data skill groups and suggests that there is a great need for big data training. There are a number of established skills and skill areas where demand is high across Europe, including Apache-related skills, Databases, Programming languages, such as Python and Javascript, Cloud technologies, Data mining etc. Other skills, such as data visualisation and a number of data management technologies are emerging trends with steady growth, but not yet to the high levels of other areas.

In order to identify big data training needs and obtain a vision about existing skills in NSIs, a survey targeted at big Data focal points in EU countries was conducted. The report describes the survey results as well as the analysis of existing skills in the ESS according to big data skills framework for official statistics.

In particular, the survey defined the groups of skills that NSIs would like to acquire:

- Methodological skills; - Technical skills; - Visualization and storytelling skills; - Contextual skills and - Soft skills.

Several data types/data sources, such as web-scrapped data, mobile phone data, sensor data, scanner data etc. have been frequently listed as priorities. The survey defined that training should be performed at different levels (introductory and advanced). Training can be targeted at different profiles of the employees. Individuals and big data teams can receive training. Training should take into account the big data sources and types that would be addressed in ESS.

The minimum and maximum number of trainees varies depending on the size of the NSI and the NSI training strategy.

7

The priorities defined in the survey included:

- Priorities for training methods/knowledge transfer types (such as webinars, online courses along with face-to-face training); - Priorities for data sources/data types (webs-craped data and mobile phone data are frequently mentioned in the priorities); - Priorities for technologies/methods/skills (like introductory big data methodologies and skills); - Priorities for other issues (trainings up to 1 week, trainings on the job).

Furthermore, based on the survey results and skills required for operation with big data, a set of training objectives with corresponding content were identified. The report includes the definition of what the trainees should be able to do as a result of the training, to what standards and under what conditions. The training objectives and content defined in the report are ready to be used for the design of the effective training plan.

Finally, the report presents an analysis of the possible ways big data skills training can be provided to the staff of the statistical offices of the ESS.

Several existing tools, such as the ESTP (European Statistics Training Programme) and The European Master in Official Statistics (EMOS) are taken into consideration, as well as possible e-learning training mechanisms. In particular, options such as Webinars, MOOCs, Videolectures, Workshops and Personalized courses portals are suggested as feasible training channels.

The report highlights the necessity of using blended (face-to-face and online) training approaches in order to reach the training needs in the area of big data for official statistics.

2. Skills required to process and analyze big data sources for the purpose of official statistics

2.1 Background

2.1.1 Defining big data In an era of growing information technology, the data creation process is evolving in an easier and faster way than data analytics.

A big data analytics survey [1] states that data-driven innovations in European public and private sectors already bring benefits for businesses, government organizations and individual citizens.

Nada Elgendy and Ahmed Elragal [2] define “Big Data” as a term recently applied to datasets that grow so large that they become awkward to work with using traditional database management systems. Thus, big- data datasets go beyond the ability of commonly used software tools and storage systems to capture, store, manage, as well as process the data within a tolerable elapsed time [3]. At the same time, Paul MacDonnell and Daniel Castro [4] state that “Big Data” includes processing information that is often heterogeneous and frequently updated. Organizations can continuously aggregate data at a group level,

8 collecting data for all users and not only for a sample of users. Big data sizes are constantly increasing, currently ranging from a few dozen terabytes (TB) to many petabytes (PB) of data in a single data set.

According to a Data & Analytics Survey [5], more and more companies intend to implement data-driven projects and 53% of companies target at generating a greater value from existing data. Out of the projects that are underway or in the planning stages, 26% are already implemented, 14% are in the process of implementation or pilot testing and 13% are planning implementation in the next 12 months. Another 8% are considering a data-driven project, and 8% say they are likely to pursue one, although they are currently struggling to find the right strategy or solutions.

38% of respondents do not have plans to implement data-driven projects, but would still like to perform analytics on existing data. The listed data challenges include finding correlations across multiple disparate data sources (60%), predicting customer behaviour (47%), and predicting product or service sales (42%), identifying computer security risks, analysing high-scale machine data and predicting fraud and financial risk. At the same time, some respondent segment would like to pay more attention to analysing social media data.

Different organizations usually operate with different types of data collected. Enterprises are more likely to collect transactional data, machine-generated/sensor data, government and public domain data, and data from security monitoring. Smaller organizations collect email, data from third-party databases, social media, and statistics from news media.

Top data sources include sales and financial transactions (56%), leads and sales contacts from customer databases (51%), and email and productivity applications (39%).

One of the biggest challenges for data-driven projects is dealing with unstructured data (emails, word documents, presentations, etc.).

Dealing with sensor streams (coming from wearable medical devices, automated homes and intelligent roadways) now becomes a new trend. Internet of Things [4] is a term used to describe the set of physical objects embedded with sensors or actuators and connected to a network. The estimations say that by 2020 there will be around 50 billion sensor devices worldwide.

The necessity to work with big data produces training needs in this area. Several studies and initiatives in the area of big data skills analysis and big data training are described below in this deliverable.

2.1.2 Related studies and initiatives In recent years, a number of initiatives related to big data skill analysis and training appeared. Below, we describe the European Data Science Academy (EDSA) project and the Big Data Value Association (BDVA) reports related to demand analysis and we analyse a set of big data MOOCs available via online training portals.

2.1.2.1 EDSA

EDSA project overview

The European Data Science Academy (EDSA) [6] is a Horizon2020 project that

9

- Analyses the sector-specific skillsets for data analysts across Europe’s main industrial sectors; - Develops modular and adaptable curricula to meet these data science needs; and - Delivers training supported by multiplatform and multilingual learning resources based on these curricula.

EDSA curriculum provides a set of data science trainings (including trainings in the area of big data) for the following topics:

- Foundations of Data Science - Foundations of Big Data - Statistical / Mathematical Foundations - Programming / Computational Thinking ( and Python) - Data Management and Curation - Big Data Architecture - Distributed Computing - Stream Processing - Linked Data and the Semantic Web - , Data Mining and Basic Analytics - Big Data Analytics - Process Mining - Social Media Analytics - Data Visualisation and Storytelling - Data Exploitation, including data markets and licensing.

An important task of the EDSA project is monitoring data science trends to assess the demands for particular skills and expertise in Europe. EDSA partners have developed dashboards to present the current state of the European data science landscape, with the data feeding into the development of curricula by using interviews with data science practitioners, an industry advisory board representing a mix of sectors, and automated tools for extracting data about job posts and news articles.

The EDSA plans to align demand with supply of training materials in data science.

Training Delivery and Learning Analytics

Training delivery in EDSA is performed through eBooks, MOOCs, video lectures and face-to-face training. EDSA partners are working on integrated learning pathways, translated into European languages and expanded to meet the requirements for specific sectors as indicated by our demand analysis.

For learning analytics, EDSA partners are using VideoLectures.NET and FutureLearn – the largest European MOOC platform, founded by The Open University – to maximise outreach and uptake of our materials.

EDSA skills ontology

The Skills and Recruitment Ontology (SARO) [7] is a domain representing occupations, skills and recruitment. It is modelled by considering several similar context models, but is mainly inspired by the

10

European Skills, Competences, Qualifications and Occupations ontology (ESCO) and Schema.org. The ontology is structured along four dimensions: job posts, skills, qualifications and users.

Job posts refer to job advertisements by organizations. Advertised job openings comprise various essential attributes, such as the job role, title, the relevant sector and other related descriptions (defined by Schema.org, e.g. job location, date posted, working hours, etc.).

One of the most important job requirement that is usually explicitly defined is the list of qualifications fitting for this role, including fundamental skills which are required to fulfil the role. SARO also describes the proficiency level for each skill.

Skills are harnessed by a group of users based on their tasks. For example, an educator or trainer could develop training resources related to certain skills or competences. In order to do so, a specific skill can be chosen by considering the skill gap of another user group, e.g. the domain specialist.

2.1.2.2 BDVA reports The BDVA reports [8] section contains a list of references to interesting reports/white papers/articles related to big data.

For instance, O'Reilly's 2016 Data Science Salary Survey [9] presents results from 983 respondents, working across a variety of industries, who answered questions about the tools they use, the tasks they engage in, and the salaries they make. The 2016 survey includes data scientists, engineers, and others in the data space from 45 countries and 45 US states.

O'Reilly's 2016 Data Science Salary Survey suggests the following tasks for Data Scientist: - Basic Exploratory Data Analysis - Conducting Data Analysis to Answer Research Questions - Communicating Findings to Business Decision-makers - Data Cleaning - Creating Visualizations - Identifying Business Problems to be Solved with Analytics - Feature Extraction - Developing Prototype Models - Organizing and Guiding Team Projects - Implementing Models/Algorithms into Production - Collaborating on Code Projects - Teaching/Training Others - Planning Large Software Projects or Data Systems - Developing Dashboards - ETL - Communicating with People outside your Company - Setting up/managing Data Platforms - Developing Data Analytics Software - Developing Products that Depend on Real-Time Data Analytics - Using Dashboards and Spreadsheets to Make Decisions - Developing Hardware

11

The survey states that coding is an important part of a data scientist's job. Python and Spark are among the tools that contribute most to salary. The top two tools in the sample were Excel and SQL, both used by 69% of the sample, followed by R (57%) and Python (54%).

Different skill groups from the O'Reilly survey include the following popular Data Scientist tools:

Programming languages: SQL, Python, R, JavaScript, Go, Octave, Ruby, SAS, Perl, C#, C, Scala, Matlab, C++, Visual Basic/VBA, JavaScript, Java, Bash.

Relational databases: MySQL, Oracle Exascale, Redshift, SAP HANA, Aster Data (Teradata), EMC/Greenplum, Netezza (IBM), Vertica, IBM DB2, Teradata, SQLite, PostgreSQL, Oracle, SQL Server.

Hadoop: EMC / Greenplum, Oracle, MapR, Amazon Elastic MapReduce (EMR), Hortonworks, Cloudera, Apache Hadoop, IBM.

Search: Solr, ElasticSearch, Lucene.

Data Management, Big Data Platforms: Couchbase, Storm, Amazon DynamoDB, Splunk, BigQuery/Fusion Tables, Neo4J, Redis, Zookeeper, Cassandra, Toad, Impala, Pig, Hbase, Amazon RedShift, MongoDB, Hive.

Spreadsheets, Business Intelligence, Reporting: Excel, Jaspersoft, Alteryx, Microstrategy, Adobe Analytics, Pentaho, Oracle BI, Cognos, BusinessObjects, QlikView, Power BI, PowerPivot, Spark.

Visualization Tools: JavaScript InfoVis Toolkit, Processing, Bokeh, Google Charts, D3, Shiny, Matplotlib, Tableau, ggplot.

Machine Learning, Statistics: IBM Big Insights, BigML, Vowpal Wabbit, KNIME, Dato / GraphLab, Stata, Mathematica, Mahout, LIBSVM, RapidMiner, H2O, Weka, Spark MlLib, Scikit-learn, Google Prediction.

The O'Reilly report is a useful source of infomation that reflects the grouping of skills and tools for data science in general and big sata in particular.

2.1.2.3 Related MOOCs Following the high demand for big data skills and the growing trend of big data technologies, we have studied the available online big data training resources in the form of MOOCs. Table 1 presents running and planned MOOCs in the area of big data. As shown in Table 1, the content of MOOCs varies – from general MOOCs on the topic of big data and data science to more specific MOOCs in the areas of Bioinformatics, Internet of Things (IoT) etc.

Table 1: Running/Planned Big Data MOOCs MOOC TITLE TOPICS AND TAGS Accounting Analytics (Coursera) Business & Management, Statistics & Data Analysis Business Analytics, Accounting Analytics, Financial Performance, Forecasting, Prediction Models, Big Data, Non- financial Metrics

12

Foundations of marketing analytics Business & Management, Statistics & Data Analysis (Coursera) Marketing Analytics, Business Analytics, Big Data, Databases, Data Analysis, Statistical Segmentation, Managerial Segmentation, Customer Hadoop Platform and Application Data Science, Statistics & Data Analysis Framework (Coursera) Big Data, Hadoop Platform, Data Analysis, Spark, Map- Reduce, Hadoop, Hadoop Stack, HDFS A Crash Course in Data Science Business & Management, Data Science, Statistics & Data (Coursera) Analysis Data Science, Big Data, Machine Learning, Statistics, Software Engineering Data-driven Decision Making Data Science, Statistics & Data Analysis (Coursera) Decision Making, Data Analytics, Big Data, Data Analysis, Business Big Data Integration and Processing Data Science, Statistics & Data Analysis (Coursera) Big Data, Data Integration, Processing, Data Science, Hadoop, Spark Big Data for Better Performance Marketing & Communication (Open2Study) Big Data, Marketing, Predictive Marketing Advanced Algorithms and Complexity Computer Science: Programming & Software Engineering, (Coursera) Computer Science: Theory Algorithms, Data Structures, Big Data, Machine Learning Introduction to Big Data (Coursera) Data Science, Statistics & Data Analysis Big Data, Data Science, Hadoop Graph Analytics for Big Data Data Science, Statistics & Data Analysis (Coursera) Big Data, Graph Analytics, Data Analysis, Graphs, Neo4j, GraphX Machine Learning With Big Data Data Science, Statistics & Data Analysis (Coursera) Big Data, Machine Learning, KNIME, Spark, Algorithms, Clustering Analysis Managing Big Data with MySQL Data Science, Statistics & Data Analysis (Coursera) Big Data, MySQL, Analytic Techniques, Databases, Business Analysis, Queries Big Data, Genes, and Medicine Biology & Life Sciences, Data Science, Information, (Coursera) Technology, and Design Big Data, Genes, Medicine, Bioinformatics, Genetics, Human Body Cloud Computing Applications, Part 2: Computer Science: Systems, Security, Networking Big Data and Applications in the Cloud Cloud Computing, Big Data, Applications, Cloud, Cloud (Coursera) Applications, Data Analysis, MapReduce, Spark, Cloudera, MapR, NOSQL Databases, HBase, Kafka, Spark Streaming, Lambda, Kappa, Graph Processing, Machine Learning, Deep Learning The Importance of Listening Data Science, Statistics & Data Analysis (Coursera) Social Media, Listening, Marketing, Big Data Big Data Modeling and Management Data Science, Statistics & Data Analysis Systems (Coursera)

13

Big Data, Modeling, Management Systems, Data Analysis, Analytical Tools, AsterixDB, HP Vertica, Impala, Neo4j, Redis, SparkSQL Processing Big Data with Hadoop in Computer Science: Programming & Software Engineering Azure HDInsight (edX) Big Data, Hadoop, Azure HDInsight, Microsoft Azure, Hive, Pig, Sqoop, Oozie, Mahout, R Language, Storm, HBase Implementing Real-Time Analytics Computer Science: Systems, Security, Networking with Hadoop in Azure HDInsight (edX) Hadoop, Azure HDInsight, Big Data, HBase, Storm, Spark, Microsoft Smartphone Emerging Technologies Computer Science: Programming & Software Engineering, (Coursera) Electronics, Engineering, Statistics & Data Analysis Emerging Technologies, SmartPhone, Smartphones, IoT, Internet of Things, Big Data, Operating Systems, iOS, Android Big Data Science with the BD2K-LINCS Biology & Life Sciences, Statistics & Data Analysis Data Coordination and Integration Big Data, Analysis, LINCS, Network Analysis, Bioinformatics Center (Coursera) Big Data Science with the BD2K-LINCS Biology & Life Sciences, Statistics & Data Analysis Data Coordination and Integration Big Data, Analysis, LINCS, Network Analysis, Bioinformatics Center (Coursera) Big Data, Cloud Computing, & CDN Computer Science: Systems, Security, Networking Emerging Technologies (Coursera) Cloud Computing, Big Data, CDN, Content Delivery Network, Emerging Technologies, Smartphones, IoT, Internet of Things

Internet Emerging Technologies Computer Science: Systems, Security, Networking (Coursera) Emerging Technologies, Internet, IoT, Internet of Things, Big Data, IP, Internet Protocol, IPv4, IPv6, TCP, UDP Internet of Things & Augmented Computer Science: Programming & Software Engineering, Reality Emerging Technologies Electronics, Engineering (Coursera) Emerging Technologies, Big Data, IoT, Internet of Things, Augmented Reality, AR, WSN, Wireless Sensor Network, M2M Python for Genomic Data Science Data Science, Statistics & Data Analysis (Coursera) Genomic Data, Python, Programming, Big Data

Figure 1 shows the most popular skills extracted from MOOCs, based on MOOC topics and tags. Some of the most popular skills represented in MOOC training are Data Analysis, HBase, Internet of Things, Bioinformatics, Cloud Computing and Algorithms.

14

MOOCs Skills

Marketing Engineering Cloud Computing Algorithms HBase Biology & Life Sciences Internet of Things Computer Science: Programming & Software Engineering Data Analysis Big Data 0 5 10 15 20 25 30 number of MOOCs

Figure 1: Big Data MOOCs Skills The data science MOOCs shown above and the training materials from the EDSA project represent the supply of training in the area of big data. Below, we present the methodology that enables to identify the demand of skills in the area of big data, with a particular emphasis on big data for statistics.

2.2 Data collection

2.2.1 Methodology for data collection In the skills analysis process, we have built a data acquisition and enrichment pipeline displayed in Figure 2. We have used the Adzuna API [10] to obtain a dataset of job postings related to data science from the UK, France, Germany and the Netherlands, and other crawling mechanisms for data provision from a variety of European countries, such as Denmark, Ireland, Romania, Italy, Switzerland, Belgium, Austria, Spain, Hungary, Sweden, Czech Republic, Poland and Portugal.

Adzuna is a for job ads that operates websites in 11 countries, aggregating vacancies from different job portals into one storage unit.

The crawled data have been characterized by several important features, such as multi-linguality, representation in JSON form, cross-country view and the presence of particular geographical and time components. After obtaining relevant datasets, we perform a number of data enrichment processes, including wikification and geo enrichments. Wikification is an expression that relates to identifying and linking textual components to the disambiguated pages [11]. We have used the JSI Wikifier [12], which supports cross-linguality and multi-linguality functions that provided a possibility to annotate textual information about job postings in different languages with cross-lingual Wikipedia information.

15

Crawled RDF Data

multi-lingual original job info

Wikification Geo extracted skills cross-country Enrichment extracted Wiki concepts

json format

geo info

Figure 2: Data Acquisition and Enrichment Pipeline

Figure 3 provides a snapshot of the job posting wikification with JSI Wikifier. The results obtained from the Wikifer have been aligned with the skill ontology via name matching.

Following that, we have enriched data with concepts from the GeoNames ontology [13]. We have added the GeoNames location URI and location name to job postings where latitude and longitude were available. We have also added the coordinates and location URI to the postings where only the location name was available.

Figure 3: Example of Job Postings Wikification with JSI Wikifier

16

2.2.2 Data statistics Figure 4 presents a glance on top locations for Data Analytics in Europe. It is possible that our mechanisms cover the majority of European countries – which we consider to be particularly useful, since the collected data reflects the skill specifics both on European and national levels.

Figure 4: Top Locations for Data Analytics in Europe Table 2 presents information about achieved data coverage. Table 2: Number of Job Postings by Country COUNTRY NUMBER OF JOB POSTINGS UK 251357 France 101661 Germany 140574 The Netherlands 108202 Switzerland 24253 Italy 20623 Belgium 20286 Poland 22837 Denmark 10844 Romania 15807 Austria 9857 Ireland 31510 Hungary 10301 Sweden 16360 Spain 16206 Czech Republic (Czechia) 11634 Portugal 25414 Malta 21 Norway 21 Bulgaria 35 Slovakia 3

17

Estonia 1 ToTAL >837.000

2.2.3 Literature analysis Additionally, to analyse job posting data, we have analysed an extensive literature collection, represented by the Microsoft Academic Graph [14] and recommended sources in the area of big data (presented in the Appendix 9.1).

The Microsoft Academic Graph [14] is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. This graph is used to power experiences in Bing, Cortana, Word, and in Microsoft Academic.

From the Microsoft Academic Graph we have extracted 4683 papers marked with a keyword “Big Data”. The recommended literature included paper collections in the following areas:

- Credit card data; - Mobile network data; - Mobile phone & wearables sensors data; - Network data; - Text Analytics; - Web data; - Wikipedia.

We have used JSI Wikifier described above to annotate and to extract relevant skills from the literature sources.

2.3 Skills analysis In order to analyse the skills demand in the area of big data for official Statistics, we have developed a methodology for skills analysis presented below.

2.3.1 Methodology for skills analysis The methodology for skills analysis includes the following steps:

1. Clustering of job postings with the OntoGen tool [15]. 2. Establishing relevant skills groups based on clustering outcomes, and results from the related studies and initiatives. 3. Detailed analysis of big data trends for different skill groups. Establishing the behaviour of each skill group and identification of the highly demanded skills within each skill group and emerging skills within each skill group. 4. Identification of the correlated skills for each defined skill group, as well as finding correlated skills for skills of particular interest (such as skills from Statistical tasks and Statistics and Business intelligence group). 5. Skills analysis for literature sources.

18

2.3.2 Clustering skills with OntoGen OntoGen is a semi-automatic and data-driven ontology editor focusing on editing topic ontologies (a set of topics connected with different types of relations). Using OntoGen allows for clustering documents and identifying a set of skill groups based on the collection of job postings.

Figure 5 and Figure 6 below present the results of job posting clustering. On Figure 5 we have highlighted a cluster of job postings that explicitly mention the “statistics” skill. On Figure 6 it is possible to notice the areas that reflect the database skill group (including such skills as “SQL” etc.) in the centre, soft skill group (including such skills as “leadership” et al.) and the upper-level skill group (including skills like “computing”, “artificial intelligence” etc.)

Statistics

Figure 5: Jobs Posting Content Clustering with OntoGen

19

Databases skills

Upper-level skills Soft skills

Figure 6: Jobs Postings Skills Clustering with OntoGen

2.3.3 Analysis by skills groups Based on OntoGen job postings clustering and other relevant skill classifications (for instance, the O’Reilly data science report), we have established a number of skill groups, taking into account needs of statisticians that operate with big data. In particular, in our skill group classification we included skills related to technologies for statisticians and data scientists, tasks for statisticians and data scientists, and soft skills required for statisticians and data scientists [16].

Figure 7 presents initial skills classification by groups:

- Soft skills, - Tasks and Methods (related to data science, Statistics, Administrative tasks for statistical purposes, IT tasks for statistical purposes, Budget tasks for statistical purposes) and - Tools and Technologies (related to Statistics and Business Intelligence, Soft Skills, Tasks, Architecture Technologies, Cloud Technologies, Data Management, Search Technologies, Data Mining Tools, Databases, Upper-Level Technologies, Hadoop, Programming Languages, Visualization Technologies).

20

Figure 7: Skills by Groups

Table 3 presents extended updated sets of individual skills for Soft skills and Tasks and Methods skills. Table 4 presents extended updated sets of individual skills for Tools and Technologies skills group. Table 3 and Table 4 constitute a big data skills framework.

Table 3: Skill Groups (Soft Skills, Tasks and Methods) SKILLS GROUP SKILLS SOFT SKILLS Communication Coordination Creative problem solving Delivery of results Ethics Information privacy Initiative Innovation and contextual awareness Leadership Logic Negotiation Specialist knowledge and expertise Teamwork

21

STATISTICAL AND DATA Algorithmic-based inference SCIENCE TASKS Analysis of aggregated data Analysis of micro data Data conversion Data curation Data dissemination Data governance Data processing Data querying Data resource management Data search Data storage Data visualization Design-based estimation Developing and maintaining statistical classifications Developing dashboards Developing prototypes Editing and Imputation techniques Model-based estimation Multivariate analysis Nonresponse adjustment and weighting Nowcasting and projections Quality assessment Sampling design Setting up data hubs Setting up data warehouses Spatial analysis/GIS/cartography Standardizing data Statistical confidentiality and statistical disclosure control Time series and seasonal adjustment User needs assessment ADMINISTRATIVE TASKS (FOR Conducting stakeholder’s consultation STATISTICAL PURPOSES) Contract negotiation Delivering training Documentation writing Drafting legal acts Managing contracts, grants or other agreements Managing task forces Preparing contracts, grants or other agreements Project management Quality assurance and compliance INFORMATION Analysis of requirements TECHNOLOGIES TASKS (FOR Developing software STATISTICAL PURPOSES) Hardware and infrastructure Security System Architecture Systems and software maintenance Testing of systems and software

22

User support

Table 4: Skill Groups (Tools and Technologies) SKILLS GROUPS SKILLS PROGRAMMING LANGUAGES C C# C++ ECL Go Java Javascript Julia Octave Python R Ruby Scala Visual Basic ARCHITECTURE TOOLS AND 5C architecture TECHNOLOGIES Data intensive computing Data intensive systems Distributed computing Distributed filesystems Distributed parallel architecture High-Performance Computing HPCC MIKE2.0 CLOUD TOOLS AND Cloud computing TECHNOLOGIES HADOOP Hadoop RHIPE YARN UPPER-LEVEL DATA SCIENCE Artificial intelligence TOOLS AND TECHNOLOGIES Business intelligence Data mining Deep learning Feature engineering Inductive statistics IoT (Internet of Things) Multimedia analysis Natural language processing Network analysis Stream processing and analysis Understanding algorithms Web technologies (Web scrapping, Web service et al.)

23

STATISTICS AND BUSINESS Adobe Analytics INTELLIGENCE Alteryx BusinessObjects Cognos Excel Jaspersoft Mathematica Matlab Microstrategy Oracle BI pbdR Pentaho PowerPivot SAS SPSS STATA DATA MANAGEMENT Amazon RedShift Apache Flume Apache HBase Apache Hive Apache Mesos Apache Oozie Apache Phoenix Apache Pig Apache Sqoop Apache Storm Apache ZooKeeper Aster Data (Teradata) BigQuery Cassandra Cloudera Impala EMC (Greenplum) Netezza Redis Splunk Vertica DATABASES Couchbase Database DBMS MongoDB MySQL NoSQL Oracle PostgreSQL Query languages RDBMS SAP HANA

24

SQL SQL Server SQLite Toad SEARCH TECHNOLOGIES ElasticSearch Lucene Search based applications Solr DATA MINING Apache Mahout BigInsights BigML Google Prediction LIBSVM Orange RapidMiner Scikit-learn Spark MlLib Vowpal Wabbit Weka VISUALIZATION Bokeh TECHNOLOGIES Chart.js ChartBlocks Chartist.js D3 Datawrapper Ember Charts FusionCharts ggplot Highcharts InfoVis Leaflet Matplotlib N3-charts NVD3 Plotly Polymaps Processing.js Sigma JS Shiny Tableau Visually ZoomData

2.4 Results Based on the skill group analysis and job posting data, we were able to observe the trends in different skill groups over time. The results from the skills analysis literature sources are presented in section 5.3.

25

2.4.1 Trending skills Figure 8 shows growing trends for tools & technologies, tasks & methods and soft skills in year 2016, which confirms the assumption that big data technologies and tasks will be actively developing in the next years. Figure 9 presents trends for Statistical tasks (based on the available data).

By looking at Figure 10 (which presents trends for different technologies), it is possible to notice that many big data technological areas (in particular, statistics and business intelligence) have increasing trends.

Furthermore, Appendix 9.2 presents more information about trending skills by groups.

Technologies, Tasks and Soft Skills trends

technologies tasks soft skills

jobpostings Linear (technologies) Linear (tasks) Linear (soft skills)

Figure 8: Technologies, Tasks and Soft Skills Trends

Statistical tasks jobpostings

1-1- 2-1- 3-1- 4-1- 5-1- 6-1- 7-1- 8-1- 9-1- 10-1- 11-1- 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016

Figure 9: Statistical Tasks and Methods

26

Tools and Technologies trends architecture cloud technologies

data management

data mining tools

databases

upper-level technologies

hadoop

programming languages

search

statistics and business intelligence

visualization jobpostings all technologies

Linear (architecture)

Linear (cloud technologies)

Linear (data management)

Linear (data mining tools)

Linear (databases)

Linear (upper-level technologies) Linear (hadoop)

Linear (search)

Figure 10: Tools and Technologies Trends by Groups In addition to trend analysis we have also identified highly demanded and emerging skills in different skill groups based on our data (Table 5, Table 6, Table 7).

Table 5: Highly Demanded Skills (Soft Skills, Tasks and Methods) SKILLS GROUP SKILLS SOFT SKILLS Specialist knowledge and expertise Creative problem solving Communication Innovation and contextual awareness Logic

27

STATISTICAL AND DATA Data querying SCIENCE TASKS Data search Data storage ADMINISTRATIVE TASKS (FOR Managing contracts, grants or other agreements STATISTICAL PURPOSES) INFORMATION Analysis of requirements TECHNOLOGIES TASKS (FOR User support STATISTICAL PURPOSES)

Table 6: Highly Demanded Skills (Tools and Technologies) SKILLS GROUPS SKILLS PROGRAMMING LANGUAGES Python Javascript C# ARCHITECTURE TOOLS AND Distributed parallel architecture TECHNOLOGIES 5C architecture High-Performance Computing CLOUD TOOLS AND Cloud computing TECHNOLOGIES HADOOP Hadoop YARN UPPER-LEVEL DATA SCIENCE Data mining TOOLS AND TECHNOLOGIES Artificial intelligence Network analysis Business intelligence Stream processing and analysis STATISTICS AND BUSINESS Matlab INTELLIGENCE SAS Excel Oracle BI DATA MANAGEMENT Redis Apache Hive Apache HBase Apache Sqoop Apache Oozie DATABASES Database DBMS SQL SQL Server SEARCH TECHNOLOGIES Search based applications DATA MINING Google Prediction VISUALIZATION Sigma JS TECHNOLOGIES

28

Table 7: Emerging Skills SKILLS GROUP SKILLS SOFT SKILLS Leadership Initiative STATISTICAL AND DATA Data curation SCIENCE TASKS

PROGRAMMING LANGUAGES ECL ARCHITECTURE TOOLS AND HPCC TECHNOLOGIES STATISTICS AND BUSINESS PowerPivot INTELLIGENCE DATA MANAGEMENT Cloudera Impala Apache Storm Amazon RedShift DATA MINING Scikit-learn VISUALIZATION D3 TECHNOLOGIES Bokeh Ember Charts Shiny FusionCharts Matplotlib Highcharts

Taking into account results from Table 5, Table 6 and Table 7 are a basis for future work of subsequent tasks in this project. In particular, for preparing a questionnaire for statistical offices.

2.4.2 Correlated skills Another analysis we performed included the identification of correlated skills for Statistical tasks and Statistics and Business Intelligence tools and technologies skill group at group level as well as at individual level (individual skills inside the group). Appendix 9.3 contains information on correlated skills by all skill groups.

Figure 11 presents the top correlated skills for Statistical tasks. As can be noticed, these skills are related to “databases”, “machine learning”, “data management”, web development etc.

29

Statistical tasks 70000

60000

50000

40000

30000

20000

10000

0

sql

php

html

d3.js

sales

nosql

html5

jquery

design

debian

analyst

devops

hadoop

backend

database

javascript

hardware

metadata

assurance

prototype

leadership

simulation

automation

semantic web semantic

user experience user

database design database

machine learning machine

riskmanagement

data management data

software architecture software

productmanagement software development software functionalprogramming Figure 11: Correlated Skills for Statistical Tasks

Statistics and Business intelligence Tools and Technologies 12000

10000

8000

6000

4000

numberjob of postings 2000

0

dba

html

soap

sales

nosql

cloud

mysql

sqoop

oracle

javaee

finance

analysis

security

robotics

pentaho

statistics

analytics

cloudera

database

postgresql

simulation wordpress

automation

web analytics web

user interface user

data modeling data

version control version

data conversion data

businessobjects

machine learning machine

businessintelligence

productmanagement distributed computing distributed software development software Figure 12: Correlated Skills for Statistics and Business Intelligence Skill Group

30

Figure 12 presents correlated skills for Statistics and for the Business Intelligence skill group – prominent skills that can be seen include “database”, “software development” and “analytics”, “machine learning”, “cloud technologies” and “distributed technologies”, “java”, “HTML” and “web analytics” etc.

If we look at the correlated skills for individual statistical skills, such as “Excel” (Figure 13), “SAS” (Figure 14) and “Matlab” (Figure 15), we can see a number of database-related skills, machine learning related skills, programming/statistical languages, such as C++, R, python, java, skills related to data science tasks and soft skills, such as “leadership”.

Correlated skills for other statistical tools/skills can be found in Appendix 9.4.

Excel 1200

1000

800

600

400 number jobpostings number of 200

0

r

sql

x86

xml

.net

java

json

html

linux

stata

nosql spark

cloud

oracle

debian

python

android

statistics

postgresql

peoplesoft

monitoring

web analytics web

user interface user

troubleshooting

database design database

customersupport

scripting language scripting

relational database relational

reverseengineering

artificialintelligence

project management project

amazon web amazon services distributed computing distributed software development software Figure 13: Correlated Skills for the Excel Skill

31

SAS 600

500

400

300

200 number jobpostings number of 100

0

c++

x86

jade

linux

sales

sybase

impala

fortran

node.js

finance

analysis

security

statistics atlassian

database

leadership

confluence

automation

data analysis data

apachespark

bioinformatics

.netframework

machine learning machine

riskmanagement

regression testing regression

scripting language scripting

relational database relational

amazon web amazon services

software engineering software distributed computing distributed software development software Figure 14: Correlated Skills for the SAS Skill

Matlab 400

350

300

250

200

150 number jobpostings number of

100

50

0

r

tcl

git

ios

java

laser

sales

scipy

xilinx

spark nosql

boost

mysql

oracle

design

debian

directx

android

analytics

architect

javascript

prototype

developer

automation

data analysis data

device driver device

user interface user

machine code machine

grid computing grid

.netframework

database design database

machine learning machine

imageprocessing

computerscience software engineer software project management project Figure 15: Correlated Skills for the Matlab Skill

32

2.4.3 Skills from literature analysis A detailed literature analysis allowed us to see what the most popular skills and skill groups in our literature collection are.

Tools and Technologies

Figure 16: Tools and Technologies by Groups from Literature Analysis Figure 16 shows the popularity of different technologies by skills groups. Appendix 9.5 contains a detailed view of particular technologies within each group.

2.5 Discussion Based on the results of the skills analysis, literature analysis and discussion with Eurostat, we propose the following skill groups classification (big data skills framework) in Table 8.

Table 8: Skills Classification as Basis for Questionnaire Soft skills

Communication Innovation and contextual awareness Coordination Leadership Creative problem solving Logic Delivery of results Negotiation Ethics Specialist knowledge and expertise Information privacy Teamwork Initiative

Statistical and data science tasks

Algorithmic-based inference Developing prototypes Analysis of aggregated data Editing and Imputation techniques

33

Analysis of micro data Model-based estimation Data conversion Multivariate analysis Data curation Nonresponse adjustment and weighting Data dissemination Nowcasting and projections Data governance Quality assessment Data processing Sampling design Data querying Setting up data hubs Data resource management Setting up data warehouses Data search Spatial analysis/GIS/cartography Data storage Standardizing data Data visualization Statistical confidentiality and statistical Design-based estimation disclosure control Developing and maintaining statistical Time series and seasonal adjustment classifications User needs assessment Developing dashboards

Administrative support tasks (for statistical purposes)

Conducting stakeholder’s consultation Managing contracts, grants or other Contract negotiation agreements Delivering training Managing task forces Documentation writing Preparing contracts, grants or other Drafting legal acts agreements Project management Quality assurance and compliance

Information technologies tasks (for statistical purposes)

Analysis of requirements System Architecture Developing software Systems and software maintenance Hardware and infrastructure Testing of systems and software Security User support

Statistics and business intelligence tools and technologies

Adobe Analytics Microstrategy Alteryx Oracle BI Apache Spark pbdR BusinessObjects Pentaho Cognos PowerPivot Excel SAS Jaspersoft SPSS Mathematica STATA Matlab

34

Visualization tools and technologies

Bokeh Matplotlib Chart.js N3-charts ChartBlocks NVD3 Chartist.js Plotly D3 Polymaps Datawrapper Processing.js Ember Charts Sigma JS FusionCharts Shiny ggplot Tableau Highcharts Visually InfoVis ZoomData Leaflet

Data management

Amazon RedShift Apache ZooKeeper Apache Flume Aster Data (Teradata) Apache HBase BigQuery Apache Hive Cassandra Apache Mesos Cloudera Impala Apache Oozie EMC (Greenplum) Apache Phoenix Netezza Apache Pig Redis Apache Sqoop Splunk Apache Storm Vertica

Databases

Couchbase Query languages Database RDBMS DBMS SAP HANA MongoDB SQL MySQL SQL Server NoSQL SQLite Oracle Toad PostgreSQL

Upper-level data science tools and technologies

Artificial intelligence Multimedia analysis

35

Business intelligence Natural language processing Data mining Network analysis Deep learning Stream processing and analysis Feature engineering Understanding algorithms Inductive statistics Web technologies (Web scrapping, Web IoT (Internet of Things) service et al.)

Data mining tools and technologies

Apache Mahout RapidMiner BigInsights Scikit-learn BigML Spark MlLib Google Prediction Vowpal Wabbit LIBSVM Weka Orange

Search tools and technologies

ElasticSearch Lucene Search based applications Solr

Programming languages

C Julia C# Octave C++ Python ECL R Go Ruby Java Scala Javascript Visual Basic

Architecture tools and technologies

5C architecture Distributed parallel architecture Data intensive computing High-Performance Computing Data intensive systems HPCC Distributed computing MIKE2.0 Distributed filesystems

36

Cloud tools and technologies Cloud computing

Hadoop

Hadoop RHIPE YARN

3. Existing skills in the statistical offices of the ESS, Eurostat and NSIs

3.1 Existing skills overview The analysis of existing skills plays an important role in big data training for official statistics. It can be performed via specialized questionnaires providing big data skills assessments. Another way to identify skills available at NSIs is to look at publicly available statistician profiles.

In order to obtain a glimpse on the skills of statisticians currently available, JSI conducted an experiment analysing statisticians’ profiles on LinkedIn1. LinkedIn is a business- and employment- oriented social networking service that operates via websites and mobile apps. LinkedIn is mainly used for professional networking, including employers posting jobs and job seekers posting their CVs.

In particular, in the experiment, JSI manually extracted skills from available statistician profiles obtained using the filter “National Statistical Institute”. Skills from the profiles of 40 statisticians working- or having previously worked- in European NSIs have been extracted and analysed.

Figure 17 presents the most popular skills from the statisticians’ profiles. Among the top skills, it is possible to notice general skills (like statistics, economics, international relations etc.), tools and technologies for statistics (SPSS, Stata etc.), skills related to tasks and methods (like project management) and in particular, to statistical tasks (statistical modelling, forecasting etc.), soft skills (like leadership) and skills related to programming languages, databases (like R, SQL).

The analysed dataset reflects the necessity of raising the awareness about big data methods, tools and technologies, since big data skills and pre-requisite skills for working with big data are rarely met in statistical profiles.

1 https://www.linkedin.com (accessed in September 2017) 37

Skills mentioned in LinkedIn statisticians profiles 30

25

20

15

10

5

0

r

sql

spss

stata

english

strategy

teaching

statistics

analytics

databases

leadership

economics

forecasting

data analysis data

survey design survey

policy analysis policy

analyticalskills

microsoft excel microsoft

public speaking public

europeanunion

capacity building capacity

strategic planning strategic

statistical modeling statistical

project management project

quantitative analytics quantitative

international relations international

software development software

international development international organizational development organizational software project management project software

Figure 17: LinkedIn Experiment

Table 9 presents the skills extracted from statistician’s profiles in the LinkedIn experiment according to the big data skills framework.

Table 9: Skills Based on LinkedIn Experiment According to Skills Framework SKILLS GROUP SKILLS FROM LINKEDIN STATISTICIAN PROFILES

SOFT SKILLS Communication, Teamwork, Information privacy, Negotiation, Ethics, Coordination, Leadership

STATISTICAL AND DATA Algorithmic-based inference, SCIENCE TASKS Statistical confidentiality and statistical disclosure control, Multivariate analysis, Time series and seasonal adjustment, Data visualization, Data resource management,

38

Data governance, Setting up data warehouses ADMINISTRATIVE TASKS (FOR Project management, STATISTICAL PURPOSES) Delivering training INFORMATION Testing of systems and software, TECHNOLOGIES TASKS (FOR Security, STATISTICAL PURPOSES) Developing software

PROGRAMMING LANGUAGES Visual Basic, C++, Java, Python, R ARCHITECTURE TOOLS AND Distributed computing TECHNOLOGIES UPPER-LEVEL DATA SCIENCE Machine learning, TOOLS AND TECHNOLOGIES Databases, Data mining, Business intelligence STATISTICS AND BUSINESS Excel, INTELLIGENCE SAS, Matlab, SPSS, Stata DATA MANAGEMENT DATABASES SQL, Oracle, SAP VISUALIZATION Tableau TECHNOLOGIES

4. Analysis of the training needs of the statistical offices In order to obtain an overview of the training needs in the area of big data from the inside perspective, JSI conducted a big data training needs survey targeted at big data focal points in European NSIs.

The survey contained a set of the following questions to be answered and commented by respondents:

- What skills need to be acquired? - What data sources should be covered? - How many staff members need training? - By when do they need it? - What are the priorities?

Appendix 9.6 presents a questionnaire. Sections 4.1, 4.2 and 4.3 provide more details about the survey and the obtained results.

39

4.1 Big data training needs survey overview The Big Data Training Needs Survey has been conducted in July-September 2017. The questionnaire was sent to the big data focal points in 28 NSIs +3 EFTA countries.

By the 30th September 2017, 20 replies (Germany, Denmark, Hungary, Malta, United Kingdom, Poland, Austria, Czech Republic, Finland, Belgium, Latvia, Spain, Estonia, France, Slovenia, Cyprus, Slovakia, Croatia, Luxembourg and Ireland) were received.

Figure 18 presents NSIs by countries that reacted to the survey (marked in red).

Figure 18: Big Data Training Needs Survey - Answers Map

4.2 Big data training needs survey results 4.2.1 Skills to be acquired NSIs have identified different skills relevant to the use of big data in official statistics. The skills can be aggregated in the following groups:

1. Methodological skills; 2. Technical skills; 3. Visualization and storytelling skills; 4. Contextual skills and 5. Soft skills.

It was proposed that each group can have foundational/introductory and advanced skills level.

40

The skills groups based on the suggested types are described below.

Methodological skills

Introductory/Foundational skills (cited from the questionnaire replies):

- Understanding big data sources in terms of their usability; - Introduction to new data science techniques (e.g. machine learning, NLP); - Machine Learning in general; - Methodology, both traditional-statistical and specific to big data with the emphasis on data exploration, pattern recognition and information extraction, modelling; - Statistical modelling knowledge (understand the scope, content, units, etc. of information stored in different data sources and how to build a common dataset / how to clear datasets to make them possible to combine, design and apply statistical models; - Acquisition, mining and processing big data; - Methods of cleaning, editing source data & adjustment in large datasets; - Knowledge of new data structures and new data science techniques; - Knowledge of data base models associated with big data; - Data streaming.

Several NSIs (Hungary, Malta, Estonia, Slovenia) mentioned importance of exploring different methods of combination and data linking of different data sources:

- How to use different data sources (statistical surveys, administrative data, big data, other) and methods to combine them (data linking, matching techniques); - Efficient data linkage methods in order to link big data with primary or auxiliary secondary data; - Data linking methods (when direct links do not exist).

Other foundational skills might also include:

- Bayesian Learning; - Analysis of time series and prediction; - Automatic learning; - Statistical skills such as quantitative and qualitative analysis, weighting, inference, validity and modelling.

Advanced skills would include:

- Advanced big data analytics, modelling with big data; - Methods to detect and manage atypical data (outlier treatment); - How to practically apply classification methods and other machine learning methods, text analysis methods; - Nowcasting; - Data mining methods; - Deep Learning; - Selecting the relevant information from huge volumes of data (high-dimensionality, high- frequency or both);

41

- How to deal with interoperability issues, access and use big data for statistical purposes and consider samples of it instead of the entire dataset for confidentiality purposes (e.g. scanner data); - Dealing with computational time issues (feasible algorithms, choice of software); - Fault-tolerance and resilience; - Efficient programming.

As a summary for this group of skills, it is possible to say that Foundational skills in general map to the Different tasks and methods and Upper level technologies from the skills framework.

It is clear that NSIs would like to have trainings that would, at the introductory level, include a more general overview of technologies that allow to work with big data and at the advanced level, that would provide more knowledge about methodologies and technologies in the area of big data.

Technical skills

Technical skills represent tools and technologies relevant for the use of big data. Depending on the training programme objective, the Technical Skills can also be grouped into Introductory Skills and Advanced Skills.

The Introductory Skills would include:

- Fundamentals of software and hardware to meet the necessary technologies for the collection, storage, processing and reporting, especially for large volumes of data as a basis for the development and adaptation of such techniques to specific problems;

Many NSIs (Spain, Poland, UK, Denmark, Belgium) mentioned that they would like to obtain training for Cloud Technologies and Distributed Processing (cited from questionnaire replies):

- Introduction to distributed processing; - Cloud computing; - Distributed computing; - Applied knowledge of distributed processing methods (e.g. Spark); - Distributed Memory Systems (* Clusters * Clouds); - Computing platforms for big data (* Computing architectures* Parallel computing * big data frameworks specially hadoop ecosystem and spark); - Storage systems for big data (* Distributed file systems * Parallel file systems * Storage technologies); - Affinity for new technologies (open source tools, Hadoop), hands-on experiences with such tools; - Parallel and distributed computing paradigms; - Mapreduce2 (pig, hive, giraph, sqoop, mahout).

Spark framework and Hadoop are among the most popular technologies:

- Spark (MLib, Spark SQL, Streaming, GraphX); - Hadoop file system;

42

- Impala; - Storm; - Flume.

Some countries included Databases and Data Structures skills:

- Management of large (sometimes unstructured) datasets; - Using, building, maintaining databases; - Basic SQL knowledge; - NoSQL (MongoDb); - Introduction to data structures (e.g. JSON, XML, SQL/NoSQL); - More complex data structures (e.g. graph databases).

Programming skills are usually represented by R and Python:

- Programming/technologies/scripting languages; - Basic proficiency in R and/or Python; - Scala.

Several responses (Latvia, Poland) also mention the following Statistical skills (tools):

- SAS; - SPSS.

Web scraping plays an important role in the NSIs expectations. They would like to learn how to create and use web scraping tools.

Additionally, respondents mentioned the following skills/technologies/needs:

- Tensorflow; - mxnet; - System administrator’s needs; - Introduction to code repositories (e.g. Github); - Collaborative project management, collaborative project development tools.

Visualization and storytelling skills

Visualization skills play an important role in modern data science.

Introductory visualization skills include (cited from questionnaire replies):

- Data Analytical/Visualisation skills; - How to communicate new indicators in an easy and understandable ways (especially, interactive and dynamic outputs); - Introduction to basic visualisation techniques (e.g. using R, Shiny, Bokeh).

Advanced Visualization Skills:

- Data analytical/visualisation skills such us combining various data processing techniques; - More advanced visualisation (e.g. D3)

43

Contextual skills

Other important big data skills are related to quality issues, security and privacy issues.

- Knowledge about context and environment, data and data owners: where are the data, owned by whom, how are they generated and stored, what are their characteristics (technical, legal and privacy status, …), what are their limits. - Understanding of security, quality and privacy issues; - Quality issues in big data; - Innovation and contextual awareness; - Information security and technology risk management;

Soft skills

The NSIs listed the following Soft Skills needed for the use of big data in official statistics

- Teamwork; - Communication; - Cooperation and negotiation; - Creative and innovative mind-set; - Interpersonal and communication skills; - Leadership and Strategic Direction; - Judgement and decision-making; - Management and delivery of results; - Building relationships and communication.

4.2.2 Data sources to be covered The NSIs often mention that they would like to obtain skills fit for data sources/data types:

- Analysis and exploitation of web data; - Social media analysis; - Working with structured, unstructured and semi-structured data.

Textual data, web scrapped data and sensor data are the most popular targeted big data types and data sources.

For instance, unstructured or semi-structured data types or data sources:

- Textual data;

- Social media data;

- Sources providing unstructured data that needs a lot of transformation, e.g. natural language processing or statistical image analysis. This could also be called feature extraction. As a two-way

44

characterization of a data source one should consider the complexity of the data (messy/unstructured data vs straight forward) and the amount of data (the need for distributed computing may be more critical for some sources).

- Other sources that are more closely related to text mining or of qualitative nature should also be presented as some other potential examples. But detailed training is not needed on these areas.

- Large scale and/or complex administrative data sources.

Web scraped data is the most frequent data source mentioned in the responses (cited from the questionnaire replies):

- The web-scraping techniques and potential sources for web-scraping (typically prices) is very important and relevant for actual developments. This should definitely be covered in the training (sources, tools, methodological problems and solutions, examples). - Web-scraped data have not yet proved to be a success, but we are looking forward into that in certain fields. We have other sources as well but no special training needs recognized for those. - Web-scraped data (INE). - Web scraping as an additional source for update of registers (e.g. tourism register).

Smart meter data and sensor data are targeted as well (cited from the questionnaire replies):

- Water and Electricity consumption; - Smart meter data; - Sensor data, any other potential source that is quantitative in nature (financial transactions, sensors) should also be included (sources, tools, methodological problems and solutions, examples); - Sensor data (especially images).

Structured data:

- Traffic loops data and similar data sources.

Another expected sources are mobile phone data or mobile telephony services data.

NSIs would also like to analyse Scanner data.

Financial transaction data would be appropriate data source for several NSIs (Malta, Hungary, Spain, Austria).

Finally, satellite and aerial photo data and License Plates data have been mentioned.

45

4.2.3 Staff members training Depending on the number of employees in a particular NSI, different numbers have been mentioned – starting from 3-4 people (including all professional types) to one statistician per domain, one methodologies per domain and several IT experts. Several NSIs responded that they would like all their staff to be trained.

For instance, in case of Slovenia, it is listed that in the general term (understanding of big data sources, usage of big data sources) many NSI members (especially subject matter statisticians) should be trained. In the narrow term (processing of big data, creating statistics and quality indicators) only a few of NSI members (two methodologists and two IT experts) should be trained.

4.2.4 Training needs for different staff members NSIs suggested that the training should be different according to each staff types and at different levels. For instance, the NSI of Denmark proposed two levels of courses (introductory and deep level) at different staff types. Previously, in section 4.2.1 about required skills introductory and advanced levels of skills have been discussed.

The training can be targeted at individual experts as well as at a data scientists’ team.

Training targeted at levels:

- Introductory level, i.e. knowledge but not necessarily ability to implement. This is a prerequisite for having an informed opinion about e.g. the technical solutions needed. The foundational skills for statistician vs methodologist would be fairly similar. However, more advanced big data skills would likely deviate. For example, visualization may become more important for domain statisticians. Methodologists may need to incorporate specific big data knowledge into specific areas of expertise (e.g. use of computationally intensive algorithms for disclosure checking). IT experts really have a completely different set of training requirements. The emphasis here will be much more on data engineering and distributed processing. They should already have good programming skills and don’t necessarily need knowledge of statistics. Their needs should be met through a separate training course. - Deep level, ability to implement and develop solutions further. Here the participants are already proficient with the technology, and the format of the training will more be that of a workshop, where people work together to solve a specific problem applying techniques at a high level.

Training targeted at data science teams

Generally, it has been thought that there should be data science teams as data science itself requires versatile skills and no one is expected to handle all of those. This would mean specific training needs for experts but also some common more general ones. Teams can be composed of different specialists rather than employing single data scientists. The training needs are different for: Team leader and experts.

For instance, the focal point from France mentions that the objective would be to create multidisciplinary teams.

Training different by profiles:

- IT support group;

46

- Statistical group/methodologists/domain units;

(Latvia)

- Infrastructure, Fault tolerance techniques, security, optimization, storage... for IT experts; - Machine learning, and big data frameworks for methodologists and statisticians.

(Spain)

- IT tools for managing big data sources (IT infrastructure is not an issue). IT experts and (or) general methodologist. - Skills regarding the potential of big data sources and their usage in statistical production should be related mostly to statisticians in particular domains. - Methods for nowcasting. Those skills are related mostly to general methodologist. - Negotiating skills. Here we have real drawback (in my experience and opinion) due to the fact that people do not have skills to negotiate with data providers in order to ensure data access. This is crucial.

(Slovenia)

- General statistical methodologists - Statisticians in a particular domain - IT experts

(Slovakia)

Training related to different data types/data sources:

- Sensor data, mobile phone data, financial transaction data etc. are similarly structured data. If one has an experience with one source of such data, it is quite easy for him (her) to switch on similar sources. General methodologists are the most appropriate to work with mentioned data. - Same goes for unstructured data (web scraping, social media data). General methodologists are the most appropriate to work with mentioned data. It experts with deep methodological knowledge are also suitable (web scraping, detection of phenomena of interests).

4.2.5 Training timeline Many NSIs mentioned that trainings should start as soon as possible, be continuous - NSIs have listed 2018 as possible starting point.

The NSI of France states that the training should start within two years.

47

The NSI of Belgium specifies that some data types (scanner data used for CPI and HICP) are already operational in statistical purposes and that in the next 2-3 years, mobile phone data, smart meter data and web-scraped data would be used in statistical offices.

The NSI of the UK mentions that they have access to data (textual data) that they would like to start analysing.

In such way, several NSIs have already started trainings.

One NSI mentioned that trainings should be project-oriented.

Two NSIs stated that using big data is a long term goal and that trainings can be organized in the coming years.

4.2.6 Training needs priorities Priorities for training methods/knowledge transfer types:

- For the introductory level and acquiring specific programming skills a cost-effective solution would be web based courses. E.g. learning Spark is better done following a specialization on Coursera rather than attending a one week course on site. As mentioned, for intermediate and especially advanced courses the format could be more workshop like (“jam session”). One NSI mentioned that they are not yet at the stage where they can transmit skills systematically and formally - they are still building knowledge. Knowledge is being transmitted at present informally and ad hoc from the people who have worked with the data and developed methods to statistical domain specialists in two main areas. - Another NSI responds that training methods don’t have to be presential (face-to-face), can be webinars, online courses, use of social media, et. - Some of the training could be done by just compiling study material and making it electronically available. Otherwise, webinars are useful to learn to use a specific software. Also, some kind of support channel or dedicated wiki page with discussion board option would be very useful. - NSI would like to have more expert courses for a smaller number of participants with high level of expertise and use more external courses for this group.

Priorities for data sources/data types:

- Web scraping data and mobile phone data are most popular data types for data source priorities, - Sensor data, - Scanner data (Price Statistics), - Water and Electricity consumption (various Social and Economic domains), - Financial transaction data (various Economic domains) and - Textual data (in general) are mentioned. - Priority should be focused on guaranteeing access to data sources, such as smart meters and mobile phone data.

48

Priorities for technologies/methods/skills:

- Foundational data science skills. - Best practices or guidelines for forming big data partnership, - Technical aspects of big data processing, - Methods for big data processing.

- Methods to combine big data sources with other types of data sources.

- Managing large (sometimes unstructured) datasets. - Using, building, maintaining databases.

- Selection of the relevant information from huge volumes of data (high-dimensionality, high-frequency or both.

- Methods for nowcasting (rapid estimates).

- Hands-on experiences with big data IT tool. - Setting up test environment so skills required by the system administrator are the most needed. - IT-skills and methods applied to the web-scraped data are important, as well as text analysis skills.

- Quality measures for processing of big data and calculated statistics.

- Programming and collaborative management tools.

Priorities for other issues:

- Time constraints is the major constraint, a lot of tutorials, training material is already available for free on the Internet, - Trainings up to 1 week, trainings on the job.

4.3 Big data training needs survey summary The Big Data Training Needs Survey has been conducted in July-September 2017. The responses received from the big data focal points included filled questionnaires (from 17 NSIs) and general comments (from 3 NSIs).

In particular, the survey defined the groups of skills that NSIs would like to acquire:

- Methodological skills; - Technical skills; - Visualization and storytelling skills; - Contextual skills and - Soft skills.

49

A number of the individual skills, such as skills targeted at understanding big data in general and linking big data to statistical data sources, stand out. Cloud Technologies and Distributed Processing, Spark framework and Hadoop, R and Python are among most required tools and technologies. According to NSIs, visualization can be addressed with R, Shiny, Bokeh and D3. Quality issues of big data, understanding context and environment of data and data owners are frequently mentioned skills as well as soft skills, such as teamwork, communication, leadership et al.

In order to assess existing skills for statisticians with skills listed in big data skills framework JSI analysed a set of skills listed in the big data Training Needs survey.

Table 10 presents skills identified by the Big Data Training Needs Survey according to the big data skills framework. It shows in details the skills that NSIs would like to target.

Table 10: Skills from Big Data Training Needs Survey According to Skills Framework SKILL GROUP SKILLS FROM BIG DATA TRAINING NEEDS SURVEY REPLIES

SOFT SKILLS Communication, Innovation and contextual awareness, Teamwork, Creative problem solving, Negotiation, Leadership, Delivery of results, Information privacy, Coordination

STATISTICAL AND DATA Nowcasting and projections, SCIENCE TASKS Nonresponse adjustment and weighting, Analysis of aggregated data, Multivariate analysis, Time series and seasonal adjustment, Quality assessment, Data visualization, Data resource management, Setting up data warehouses, Data storage, Data processing, Data conversion ADMINISTRATIVE TASKS (FOR Quality assurance and compliance, STATISTICAL PURPOSES) Project management INFORMATION System Architecture, TECHNOLOGIES TASKS (FOR Hardware and infrastructure, STATISTICAL PURPOSES) Developing software, Systems and software maintenance

PROGRAMMING LANGUAGES R, Python, Scala

50

ARCHITECTURE TOOLS AND Distributed Parallel architecture, TECHNOLOGIES Distributed computing, Distributed filesystems CLOUD TOOLS AND Cloud computing TECHNOLOGIES HADOOP Hadoop UPPER-LEVEL DATA SCIENCE Machine learning, TOOLS AND TECHNOLOGIES Databases, Understanding algorithms, Data mining, Deep learning, Artificial intelligence, Natural language processing, Stream processing and analysis, IoT (Internet of Things), Multimedia analysis, Web technologies (Web scrapping)

STATISTICS AND BUSINESS SAS, INTELLIGENCE Apache Spark, SPSS DATA MANAGEMENT Apache Hive, Apache HBase, Apache Sqoop, Apache Pig, Cloudera Impala DATABASES Passively parallel-processing databases (MPP), DBMS, SQL, NoSQL, MongoDB SEARCH TECHNOLOGIES Search based applications VISUALIZATION D3, Shiny, Bokeh TECHNOLOGIES

Several data types / data sources, such as web-scrapped data, mobile phone data, sensor data, scanner data etc. have been frequently listed as priorities.

The survey also defined that the training should be performed at different levels (introductory and advanced). Training can be targeted at different profiles of employees. Individuals and big data teams can receive training. Training should take into account the big data sources and types that would be addressed in ESS.

The minimum and maximum number of trainees varies depending on the size of the NSI and the NSI training strategy.

The priorities defined in the survey included:

51

- Priorities for training methods/knowledge transfer types (such as webinars, online courses along with face-to-face training);

- Priorities for data sources/data types (webs-craped data and mobile phone data are frequently mentioned in the priorities);

- Priorities for technologies/methods/skills (like introductory big data methodologies and skills);

- Priorities for other issues (trainings up to 1 week, trainings on the job).

4.4 Towards bridging the skills gap for big data in statistics From the information presented in the previous sections it is visible that there is a gap between the skills of statisticians from the NSIs that would like to work with big data and the skills that would allow to analyse big data for statistical purposes. Big Data Training Needs Survey provided valuable insight into the way to bridge this gap and gave a basis for developing the training strategy for the NSIs.

In particular, from the Big Data Training Needs survey it is possible to see how the European NSIs see the training aspects in relation to their working environment in coming years.

One of the aspects defined by the Survey is that training should be provided at different levels, where many team members would receive the introductory trainings with a possibility to proceed to an advanced level through advanced training.

Some training courses could also be delivered in such way that it would allow involving the whole team – methods of online training can be seen as possible delivery option.

The majority of the NSIs indicated that they would like to start the training as soon as possible, in 2018.

5. Training objectives and content for the design of a training program

5.1 Learning models and curriculum design approaches

5.1.1 Learning models In the modern literature there are a number of learning models and curriculum design approaches. The Horizon 2020 EDISON project (1 September 2015 – 31 August 2017) [1] developed a framework for what defines the data science profession. The EDISON framework includes such components as Data Science Competences Framework, Data Science Body of Knowledge, Data Science Model Curriculum and Data Science Professional Framework.

The learning models frequently described in the literature (and addressed in the EDISON project) are

 Bloom’s taxonomy;

 Problem-based learning and

 Competences-based learning.

52

Bloom’s taxonomy

Bloom’s taxonomy [18] provides a conceptual framework to organize levels of learning of a topic or subject, and assigns action verbs to each level that help to understand activities related with particular levels of learning.

Figure 19 presents the structure of Bloom’s taxonomy.

The levels of Bloom’s taxonomy include remembering level (where students identify the relevant technologies), understanding level (where students can explain how technologies work), applying level (where the right technology for a specific problem is chosen), analyzing level (where relationships are analyzed), evaluating level (where the judgements are made) and creating level (where the new solutions are created).

Figure 19: Bloom’s taxonomy

Constructive Alignment and Problem-based Learning

Contrary to the traditional learning models, where the students are provided with knowledge by a teacher and passively take it, memorizing the schemes, then are evaluated with examinations, constructive alignment model gives students a central role in the learning process and knowledge construction process [19].

53

Problem-based learning

Problem Based Learning (PBL) [20,21] is based on problems provided to students to solve based on teacher consultations. PBL assumes active student involvement and motivation through evaluation. Problem- based Learning is considered to be one of the forms of constructive alignment. Constructive Alignment was described by Biggs [22]. "Constructive alignment" refers to the learner constructing his or her learning through relevant learning activities. The teacher in this process sets up a learning environment that supports achieving the desired learning outcomes. In particular, in the area of computer science the constructive alignment process was described by Ben-Ari [23].

The EDISON framework developers state that constructive alignment and problem-based learning can be implemented in a form of project-based learning – the regular classes would provide students with competences related to specific knowledge areas, while additional project classes allow to establish a link between these competences [24].

Competence Based Learning Model

Competency Based Learning (CBL) or Competence Based Education (CBE) also known as outcomes based learning uses an approach different from the one in traditional education. Instead of focusing on how much time students spend learning a particular topic or concept, the CBL assesses whether students have mastered the given competencies - the knowledge, skills, and abilities [25]. CBL is often used for re-skilling or additional training scenarios. The benefits of CBL is its flexibility, since it allows both self-study and instructor guidance. The CBL programs usually offer the following features [26]:

• Self-pacing

• Modularization

• Effective assessments

• Intentional and explicit learning objectives shared with the student,

• Anytime/anywhere access to learning objects and resources,

• Personalized, adaptive or differentiated instruction

• Learner supports through instructional advising or coaching.

In the literature it is stated [26] that the CBL was actually created to address the needs of non-traditional students who cannot devote their entire time to traditional academic studies as well as effective models

54 for companies to provide (re/up) skilling to their staff. In such way, the CBL approaches or mixed approaches involving CBL seem to be suitable for modeling training objectives and training content for the trainees in big data area for statistical offices in Europe.

5.1.2 Curricula guidelines The ACM Committee for Computing Education in Community Colleges (CCECC) and the IEEE Computer Society have jointly produced curricular recommendations and guidelines for baccalaureate computing programs, known collectively as the ACM Computing Curricula series. The guidelines provided included the ACM Competency Model of Core Learning Outcomes and Assessment for Associate-Degree Curriculum in Information Technology (IT2014) [25]. Mainly, the recommendations focus of the student competencies instead of credit points. The measurable learning outcomes take into account Bloom’s Taxonomy.

5.2 Related curricula and classifications

5.2.1 ACM classification for computer science In the ACM classification for computer science [27] the Body of Knowledge is defined as a specification of the content to be covered in a curriculum as an implementation. The ACM Body of Knowledge includes 18 Knowledge Areas (KA):

 AL - Algorithms and Complexity

 AR - Architecture and Organization

 CN - Computational Science

 DS - Discrete Structures

 GV - Graphics and Visualization

 HCI - Human-Computer Interaction

 IAS - Information Assurance and Security (new)

 IM - Information Management

 IS - Intelligent Systems

 NC - Networking and Communications (new)

55

 OS - Operating Systems

 PBD - Platform-based Development (new)

 PD - Parallel and Distributed Computing (new)

 PL - Programming Languages

 SDF - Software Development Fundamentals (new)

 SE - Software Engineering

 SF - Systems Fundamentals (new)

 SP - Social Issues and Professional Practice

The courses that compose Computer Science curriculum should cover two types of topics - mandatory for each curriculum (“Tier-1”) and expected to be covered at least at 80% (“Tier-2”). Tier 1 and Tier 2 topics are defined differently for different programmes and specializations.

The work-place skills from the ACM classification describe the ability the student/trainee to:

 function effectively as a member of a diverse team,  read and interpret technical information,  engage in continuous learning,  professional, legal, and ethical behavior, demonstrate business awareness and workplace effectiveness

Each KA in the ACM classification is organized into a set of Knowledge Units (KU). In the final step each KU lists a set of topics and learning outcomes (LO). The LO are associated with a level of mastery derived from the Bloom taxonomy (familiarity, usage, and assessment).

5.2.2 Curriculum development in EDISON project The Data Science model curricula developed within the EDISON project is based on the ACM guidelines, taking into account Competence Based Learning model. The curriculum is organized as core and elective topics, following the ACM definition. Core topics are required to every data science program while Elective topics are specific for particular areas.

56

Table 11 presents the knowledge levels for learning outcomes in Data Science model curricula of EDISON project.

Table 11: Knowledge Levels for Learning Outcomes in Data Science Model Curricula (MC-DS)

Level Action Verbs Familiarity Choose, Classify, Collect, Compare, Configure, Contrast, Define, Demonstrate, Describe, Execute, Explain, Find, Identify, Illustrate, Label, List, Match, Name, Omit, Operate, Outline, Recall, Rephrase, Show, Summarize, Tell, Translate Usage Apply, Analyze, Build, Construct, Develop, Examine, Experiment with, Identify, Infer, Inspect, Model, Motivate, Organize, Select, Simplify, Solve, Survey, Test for, Visualize Assessment Adapt, Assess, Change, Combine, Compile, Compose, Conclude, Criticize, Create, Decide, Deduct, Defend, Design, Discuss, Determine, Disprove, Evaluate, Imagine, Improve, Influence, Invent, Judge, Justify, Optimize, Plan, Predict, Prioritize, Prove, Rate, Recommend, Solve

Annex 8.7 presented below provides a template and examples for defining the Learning Outcomes related to enumerated Data Science Competences Framework (CF-DS).

Developing training objectives for a big data training program would follow the learning outcomes model, where each topic (knowledge area) is supported by a number of learning outcomes.

5.2.3 Curriculum development in EDSA project The European Data Science Academy (EDSA) project [6] designs curricula for data science training and data science education across the European Union (EU). The EDSA establishes a virtuous learning production cycle whereby: a) the required sector specific skillsets for data scientists across the main industrial sectors in Europe are analyzed; b) modular and adaptable data science curricula to meet industry expectations are developed; and c) data science training supported by multi-platform and multilingual learning resources are delivered.

The EDSA courses portfolio

The EDSA provides a wide spectrum of courses from the following categories:

 Self-study courses: These courses consist of self-study learning materials available as Open Educational Resources (OERs). Learners can study them at their own pace, as there is no predetermined start or end date.

57

 MOOCs: These Massive Open Online Courses (MOOCs) are available on external MOOC platforms, such as FutureLearn.  Blended courses: These courses are taught in a blended way (face-to-face and online) by EDSA partners and associate EDSA partners.  Face-to-face courses: These courses are taught face-to-face by EDSA partners and associate EDSA partners.

As it can be seen from the above list, the EDSA courses cover all types of learning contexts, from the traditional face-to-face pedagogical model, to the more recent trends in online education (MOOCs and OERs).

Delivery channels and formats

EDSA courses are delivered:

• Via the Moodle Learning Management System (HTML format) • As an eBook (available via iBooks, in ePUB format)

EDSA curricula

EDSA curriculum contains courses in several stages, such as:

 Foundations;  Storage and Processing;  Analysis;  Interpretation and Use.

Tables 12-13 present the evolution of EDSA curriculum in time.

Table 12: Core EDSA Curriculum, version 1

Topic Stage

Foundations of Data Science Foundations

Foundations of Big Data Foundations

Statistical / Mathematical Foundations Foundations

Programming / Computational Thinking (R and Python) Foundations

Data Management and Curation Storage and Processing

Big Data Architecture Storage and Processing

Distributed Computing Storage and Processing

Data Intensive Computing Storage and Processing

58

Machine Learning, Data Mining and Basic Analytics Analysis

Big Data Analytics Analysis

Process Mining Analysis

Data Visualisation Interpretation and Use

Visual Analytics Interpretation and Use

Finding Stories in Open Data Interpretation and Use

Data Exploitation including data markets and licensing Interpretation and Use

Table 13: Core EDSA Curriculum, version 3

Topic Stage

Foundations of Data Science Foundations

Foundations of Big Data Foundations

Statistical / Mathematical Foundations Foundations

Programming / Computational Thinking (R and Python) Foundations

Data Management and Curation Storage and Processing

Big Data Architecture Storage and Processing

Distributed Computing Storage and Processing

Data Intensive Computing Storage and Processing

Linked Data and the Semantic Web Storage and Processing

Machine Learning, Data Mining and Basic Analytics Analysis

Big Data Analytics Analysis

Process Mining Analysis

Social Media Analytics Interpretation and Use

Data Visualisation and Storytelling Interpretation and Use

59

Data Exploitation including data markets and licensing Interpretation and Use

Table 14 demonstrates recommendations towards the EDSA curriculum development that was taken into account while developing training objectives for statistical offices in the area of big data in Europe.

Table 14: Recommendations for EDSA Curriculum Development

Title Intervention level Summary description

Holistic training General training Refine the training approach and curriculum cycle to approach approach strengthen skills along the full data exploitation chain.

Open source Existing curriculum Continue current technical and analytical training based based training design on open source technologies; apply cross-tool focus to deliver overarching training.

Soft skills Expansion of Implement soft skill training to increase performance training curriculum and organization impact of data scientists / data science teams.

Basic data Expansion of Develop basic data literacy and data science training for literacy training curriculum non-data scientists to improve basic skills across organizations and facilitate uptake of data-driven decision making and operations.

Blended training Course delivery Develop blended training approaches including sector- specific exercises and examples to increase effectiveness of training delivery.

Data science Training approach Implement data science skills framework to structure skills framework and delivery skills requirements, assess skills of data scientists, and identify individual skills needs.

Navigation and Training market Develop quality assessment of third party courses; guidance provide navigation support to identify relevant trainings from EDSA and third parties.

The EDSA project experience in creating personalized environment for learning is considered useful for developing training opportunities for statistical offices in Europe in big data area.

60

5.3 Training objectives for statistical offices in Europe in the area of Big Data

5.3.1 Defining training objectives Previously, in Task 4, the skill groups (Tables 3 and 4) and skills from the Big Data Training Needs Survey (Table 10) have been defined.

Taking into account the learning models, relevant initiatives and big data training needs described above, Table 15 presents a set of training objectives (based on learning outcomes) for big data training in the statistical offices in Europe.

Table 15: Training Objectives Mapped to Big Data Training Needs TO ID Big data topics LO by Knowledge levels Familiarity Usage Assessment Choose, Classify, Collect, Apply, Analyze, Adapt, Assess, Change, Compare, Configure, Build, Construct, Combine, Compile, Compose, Contrast, Define, Develop, Examine, Conclude, Criticize, Create, Demonstrate, Describe, Experiment with, Decide, Deduct, Defend, Execute, Explain, Find, Identify, Infer, Design, Discuss, Determine, Identify, Illustrate, Label, Inspect, Model, Disprove, Evaluate, Imagine, List, Match, Name, Omit, Motivate, Improve, Influence, Invent, Operate, Outline, Recall, Organize, Select, Judge, Justify, Optimize, Plan, Rephrase, Show, Simplify, Solve, Predict, Prioritize, Prove, Summarize, Tell, Survey, Test for, Rate, Recommend, Solve Translate Visualize SOFT SKILLS TO1 Communication, Use relevant soft skills to Coordinate, organize Be able to assess the solve the problem, obtain and lead the team. current working strategy Innovation and results, communicate and and in particular, the contextual deliver them at different Identify the strategy obtained results. awareness, levels. for task execution. Be able to prioritize Teamwork, Motivate the team. important tasks. Creative problem solving, Be able to criticize or defend the selected Negotiation, position.

Leadership, Be able to predict and Delivery of give recommendations. results,

Information privacy,

Coordination

61

STATISTICAL AND DATA SCIENCE TASKS

TO2 Data resource Execute the data Develop elements of Evaluate possible data management, management strategy data management management strategies and data management strategy and data and combine them into Nowcasting and plan. management plan. data management plans, projections, taking into account Define technical Collect/web-scrape organizational specifics. Data requirements. required big data visualization, datasets. Discover relations, Setting up data Recognize big data propose optimization and warehouses, sources as useful for Select and execute the improvements. nowcasting. most appropriate Data storage, techniques for big Develop new models and Identify formats for data warehouse set- methods (e.g. for Data processing, relevant types of big data, ups. nowcasting) if necessary. Data conversion select the suitable data formats for concrete Execute and operate Evaluate outcomes of Nonresponse problem. statistical tasks using data processing. adjustment and selected data storage weighting, Identify data in different technologies. Create solutions and formats, select most methods for data Spatial appropriate techniques Use standard conversion in statistical analysis/GIS/cart for big data processing. technologies for big domain. ography data processing. Find possible solutions for Recommend and big data storage for Be able to use influence improvements Analysis of statistical purposes. standard methods and based on continuous data aggregated data, tools for big data analysis. Choose potential conversion, pulling Multivariate technologies for data together Discover hidden relations analysis, warehouses. heterogeneous data. via visualizations. Time series and seasonal Choose appropriate Identify necessary Create and optimize existing analytical methods and use visualizations to properly adjustment methods and operate them in combination if support decision making. existing tools to do necessary. specified big data Predict and evaluate the analysis. Develop big data differences between analysis application for different technological Present data in the specific datasets and solutions for concrete required form. tasks or processes. statistical problems.

Be able to select the Identify relations and appropriate software to provide consistent visualize big data. reports and visualizations.

62

Build visualizations for complex and variable data. ADMINISTRATIVE TASKS (FOR STATISTICAL PURPOSES) TO3 Big data quality Follow big data specific Develop and adapt big Evaluate and predict the assessment, quality frameworks and data specific success of big data assurance and quality assurance methodologies for specific quality compliance, methodologies. quality assurance for frameworks and quality particular applications assurance methodologies Big data project Follow the specified of big data and use of in the context of specific management action plan for a project. big data sources. projects.

Identify the elements Evaluate and predict the of action plan for a success of different specific project. strategies and action plans for projects.

Evaluate risks related to big data technologies.

INFORMATION TECHNOLOGIES TASKS (FOR STATISTICAL PURPOSES) TO4 System Provide requirement Be able to solve the Perform quality analysis Architecture, analysis, perform key issues in Software and system evaluation for architectural design and Design: Concurrency, big data in statistics. Hardware and requirements allocation, Control and Handling infrastructure, according to specific to of events, data Perform software big data applications to persistence, engineering Developing statistical production. distribution of management. software, components, error Systems and Compare traditional and exception Perform software software architecture handling, interaction evaluation and product requirements with and presentation, management. maintenance system architectural Security (in the area of requirements for big big data). Perform evaluation of the data. process of systems and Develop the software maintenance (in Provide software infrastructure and the area of big data in prototype mock-ups. allocate the relevant statistics). hardware components Compare traditional within developed software requirements infrastructure, taking with software into account big data requirements for big specifics. data.

63

Select the relevant Develop actual technologies for software for big data developing big data in statistics. software. Establish a process of Follow the established systems and software process of systems and maintenance in the software maintenance (in area of big data. the area of big data in statistical production). PROGRAMMING LANGUAGES TO5 R, Know the basics of R: Perform machine Optimize solutions in R, R console, data types and learning with R. Pyhon, Scala. Python, structures, exploring and Scala, visualizing data, Execute Spark jobs Adapt algorithms in R, programming structures, from R. Python, Scala. Javascript main functions of R base packages, most common Perform interactive Develop packages for R, external R packages data visualizations Python and Scala. (ggplot2, stringr, tidyr, with RShiny. dplyr, readr, data.table). Evaluate the benefits of Use Python for big using the specified Know how to work with a data analytics, programming language R scripting IDE (e.g. machine learning and for the concrete task. RStudio) and how to web scraping. execute R scripts. Actively use big data Be aware of the relevant packages, such as packages in R for data pbdR, Rmpi. science, big data and distributed computing: Actively use R-Analysis pbdR, rhdfs, rhbase, of Spatial Data. SparkR, sparklyr, tidytext Actively use packages Know the basics of in Python for big data, Pyhon: types and such as PySpark, structures, exploring and NumPy, SciPy, Pandas, visualizing data, and Scikit-learn, PySAL programming structures, for Spaical Data, functions, and data ClusterPy, relationships. GeoGrouper etc.

Be aware of the relevant Perform big data packages in Python for analysis with Scala. data science, big data and Working with RDD in distributed computing: Apache Spark using PySpark, NumPy, SciPy, Scala.

64

Pandas, and Scikit-learn, Working with PySAL for spacial data, DataFrame in Apache ClusterPy, GeoGrouper. Spark using Scala. Use jvmr package. Know the basics of Scala: Scala - basic syntax, Perform reduction environment setup, operations & data types, variables, distributed key-value classes & objects, pairs. operators, function, partitioning and traits, shuffling. Scala web frameworks, comparing Scala, java, Be able to combine Python and R in Apache traditional statistical Spark. data sources with IoT and other big data Know the basics of sources. Javascript: reading in data, combining data, Perform big data summarizing data, visualization with iterating and reducing, Javascript. nesting and grouping data, using Node. ARCHITECTURE TOOLS AND TECHNOLOGIES TO6 Distributed Understand the concept Be able to choose Identify problems, and Parallel of distributed computing, parallel computing explain, analyze, and architecture, strength and limitations, platforms for parallel evaluate various and applications in the applications. distributed systems Distributed statistical domain. solutions in the statistical computing, Be able to design and domain. Understand architecture develop distributed Distributed models and parallel systems and filesystems programming models. distributed systems applications for the statistical domain.

Be able to setup Hadoop cluster.

Be able to apply the fundamental computer science methods and algorithms in the development of distributed systems and distributed systems applications.

65

Be able to perform system testing in the statistical domain.

CLOUD TOOLS AND TECHNOLOGIES TO7 Cloud computing Understand the basic Identify the Identify problems and concepts and key architecture and explain, analyze, and technologies, strength infrastructure of cloud evaluate various cloud and limitations of cloud computing, including computing solutions in computing and possible SaaS, PaaS, IaaS, the statistical domain. applications of cloud public cloud, private computing in statistical cloud, hybrid cloud, domain. etc.

Choose the appropriate technologies, algorithms, and approaches for the related issues.

Provide the appropriate cloud computing solutions and recommendations according to the applications used in statistical domain.

UPPER-LEVEL DATA SCIENCE TOOLS AND TECHNOLOGIES TO8 Machine learning, Identify statistical Solve problems Perform evaluation of the problems that can benefit associated with batch selected technologies. Databases, from machine learning learning and online Understanding and other methods learning, and the big Understand the impact of applicable to big data. data characteristics big data for decisions and algorithms, such as high strategy in statistical Data mining, Identify the dimensionality, domain. characteristics of datasets dynamically growing Deep learning, and be able to spot big data and scalability Be able to predict the data in various issues. possible challenges that Artificial applications in the particular technologies intelligence, statistical domain. Develop scaling up bring for a specific task. Natural language machine learning processing, Understand the basic techniques (and other principles of machine technologies) and learning, data mining, associated computing

66

Stream artificial intelligence, techniques and processing and natural language technologies for analysis, processing, web various applications in technologies (and other statistical domain. IoT (Internet of technologies) and big Things), data in statistical domain. Implement various ways of selecting Multimedia Understand machine suitable model analysis, learning techniques, web parameters for Web technologies technologies (and other different machine technologies) and learning techniques. (Web scrapping) computing environment that are suitable for the Integrate machine applications in the learning libraries (and statistical domain. other technologies), and mathematical and Identify key challenges statistical tools with for machine learning (and modern technologies other technologies) in the like Hadoop statistical domain. distributed file system and MapReduce Use tools for big data programming model. analytics and present the analysis result. Be able to integrate various statistical and big data types and sources. STATISTICS AND BUSINESS INTELLIGENCE TO9 Apache Spark Have a general Extracting, processing, Evaluate the benefits of knowledge about Apache analyzing data, using Spark for statistical Spark and its possible visualizing data and purposes. applications for the performing machine statistical domain. learning with Spark.

Describe Spark’s Implement typical use fundamental mechanics. cases for Spark.

Build data pipelines and query large data sets using Spark SQL and DataFrames.

Learn how to work with RDD and Data Frames in Spark.

Analyse Spark jobs using the

67

administration UIs and logs inside Databricks.

Create Structured Streaming and Machine Learning jobs.

Understand Spark internals.

Use the core Spark APIs to operate on data.

DATA MANAGEMENT TO10 Apache Hive, Have a general Use advanced data Evaluate the benefits of knowledge about Apache structures with Hive. using big data Apache HBase, Hive, Pig, HBase, Sqoop management Apache Sqoop, and Cloudera Impala and Set up and load technologies for their possible applications partitioned tables. statistical purposes. Apache Pig, in the statistical domain. Use views to query data. Cloudera Impala Describe the tools' fundamental Create indexes for mechanisms. tables.

Differentiate Hive from Utilize HDFS to store traditional Relational and manage big data. Database Management Systems. Advanced Hive and HBase. Create and query tables. Create databases HBase Data Model, Import/add and delete HBase Shell, HBase data. Client API.

Know data techniques Perform real-time, using Sqoop. interactive analytical queries of the data Be able to set up and stored in HBase or customize Cloudera HDFS with Cloudera Manager to monitor and Impala. improve the performance of any size Hadoop cluster. DATABASES

68

TO11 Passively parallel- Have a broad Be proficient in SQL. Evaluate database processing understanding of development tools and databases (MPP), database concepts and Define, compare and programming languages database management use the four types of (for statistical purposes). DBMS, system software NoSQL Databases (Document-oriented, Understand the benefits SQL, Have a high-level KeyValue Pairs, using database NoSQL, understanding of major Column-oriented and technologies for DBMS components and Graph). statistical domain. MongoDB their function. Know the concepts of Understand the replication, differences between distribution, sharding, relational and non- and resilience in a relational databases. NoSQL database.

Understand how to Work with the choose a suitable common statistical database for an use-cases and application in a statistical architectures of domain. Mongo.

Know the basics of SQL, Use Mongo's built-in NoSQL, MongoDB. JavaScript interpreter.

Be comfortable with Query Mongo using queries and update Mongo's JSON-based languages. query language.

Index Mongo collections.

Handle data with Mongo's built-in MapReduce capabilities.

SEARCH TECHNOLOGIES

69

TO12 Search engines Get basic knowledge in Understand the Evaluate the the area of search based concepts of Apache appropriateness and the apps. Lucene & respective benefits of using a search APIs. engine approach for particular statistical Understand Apache applications. Solr.

Learn Indexing and searching using Solr. VISUALIZATION TECHNOLOGIES TO13 D3 (JavaScript), Effectively use data Design effective data Evaluate visualization Shiny (R), visualization tools. visualizations. outcomes. Bokeh (Python) Be aware of data Be able to effectively Be able to select an visualization libraries. work with D3: appropriate data selections, visualization tool/library SVGs, for statistical purposes. data binding, styling with D3, scaling with D3, interactive visualizations.

Be able to build Shiny apps, customize reactions and appearance.

Be able to effectively use Bokeh for statistical purposes. HADOOP TO14 Apache Hadoop, Master the concepts of Write Complex Evaluate the outcomes of RHIPE, HDFS and MapReduce MapReduce programs. performed task. YARN framework. Learn data loading Evaluate the benefits of Understand Hadoop 2.x techniques using using Hadoop Architecture. Apache Sqoop and technologies for Apache Flume statistical tasks. Perform data analytics using Pig, Hive and YARN.

Implement HBase and MapReduce integration.

70

Implement Advanced Usage and Indexing Schedule jobs using Oozie.

Implement best practices for Hadoop development.

Understand Spark and its Ecosystem.

Work on a real life project on big data analytics. DATA MINING TOOLS AND TECHNOLOGIES TO15 Apache Mahout, Be able to identify Know how to set up Perform evaluation of the BigInsights, problems that can be data for data mining selected technologies. BigML, addressed via data experiments. Google mining methods. Be able to predict the Prediction, Solve statistical possible challenges that LIBSVM, Understand data mining problem with specific particular technologies Orange, techniques. data mining tools. bring for a specific task. RapidMiner, Scikit-learn, Have a working Use advanced data Spark MlLib, knowledge of the mining options with Vowpal Wabbit, strengths and limitations specific libraries. Weka of modern data mining methods (algorithms).

Be able to select appropriate tool for data mining tasks.

5.4 Content development in the area of Big Data

5.4.1 ESTP content In Chapter 5 possible content sources relevant for the training objectives specified in Chapter 4 have been identified.

The European Statistical Training Programme (ESTP) contains a set of courses developed especially for statistical offices in Europe. Table 16 presents a set of identified ESTP courses that could possibly be relevant for the big data training programme.

Table 16: Related ESTP Courses ESTP course Date and time

71

Introduction to Big Data and its tools Date and time: 24 - 26 January 2017

Presentation, Facilitation and Consultation Skills for Statistical Date and time: 31 January - 2 Trainers – Introductory Course February 2017

The Use of R in Official Statistics Date and time: 4-7 April 2017

Can a Statistician become a Data Scientist? Date and time: 16 – 18 May 2017

Machine Learning Econometrics Date and time: 12 – 14 June 2017

Hands-on immersion on Big Data tools Date and time: 19 – 22 June 2017

Big data sources - Web, Social media and text analytics Date and time: 18 – 21 September 2017

Introduction to Linked Open Data Date and time: 28 – 29 September 2017

Automated collection of online prices: sources, tools and Date and time: 23 – 26 October 2017 methodological aspects

Advanced Big Data Sources - Mobile phone and other sensors Date and time: 6 – 9 November 2017

In Table 17 specific training objectives for each ESTP course were identified.

Table 17: ESTP Content Mapped to Training Objectives ESTP course Training objectives Introduction to Big Data and its tools TO1 (Privacy), TO2(Visualization technologies), TO6 (Distributed computing), TO14 (Hadoop) Presentation, Facilitation and Consultation Skills for Statistical TO1 (Soft skills, Delivery of results) Trainers – Introductory Course

The Use of R in Official Statistics TO5 (R)

Can a Statistician become a Data Scientist? TO2 (Data processing, Data visualization), TO8 (Web technologies/Web scrapping) Machine Learning Econometrics TO8 (Machine learning)

Hands-on immersion on Big Data tools TO6 (Distributed computing), TO9 (Spark), TO10 (Hive), TO11 (NoSQL), TO8 (Web technologies/Web scraping), TO14 (Hadoop)

72

Big data sources - Web, Social media and text analytics TO8 (Web technologies/Web scrapping), TO2(Data processing), TO8 (Data mining), TO8 (NLP) Introduction to Linked Open Data TO8 (Upper-level technologies)

Automated collection of online prices: sources, tools and TO8 (Web technologies/Web methodological aspects scrapping), TO2 (Nowcasting and projections) Advanced Big Data Sources - Mobile phone and other sensors TO8 (Multimedia analysis), TO2 (Visualization), TO8 (IoT), TO8 (Stream processing), TO5 (R), TO5 (Python)

5.4.2 Data science content dashboard A valid source of content is the EDSA dashboard [28] that provides a set of recent jobs and training materials for a specific query.

Figure 20: EDSA dashboard – Hadoop search

73

Figure 21: EDSA dashboard – Hadoop related trainings and videolectures Figure 20 and Figure 21 demonstrate the demand of jobs and supply of training materials in the area of data science (in Europe).

Following the defined training objectives, the content (in the form of MOOC courses and available online training materials for each objective) is specified in Table 18.

Table 18: Web Content Mapped to Training Objectives Web content Training objectives Communicating Business Analytics Results TO1 https://www.coursera.org/learn/communicating-business-analytics-results SOFT SKILLS

Oral Communication for Engineering Leaders https://www.coursera.org/learn/oral-communication

Research Report: Delivering Insights https://www.coursera.org/learn/research-report

Communicating Complex Information: Presenting Your Ideas Clearly and Effectively https://www.futurelearn.com/courses/communicating-complex- information?lr=29

74

Effective Problem-Solving and Decision-Making https://www.coursera.org/learn/problem-solving

Using Creative Problem Solving https://www.futurelearn.com/courses/creative-problem-solving?lr=20

Creative Leadership for Effective Leaders (Foundation Level) https://www.openlearning.com/courses/creativethinkingandcreativeproblems olving

Data Science Fundamentals TO2 https://bigdatauniversity.com/learn/data-science STATISTICAL AND DATA SCIENCE Where to go Online to Find the Data: Data Analysis for Journalists TASKS http://www.newsu.org/courses/find-online-data

Data Processing Using Python https://www.coursera.org/learn/python-data-processing

Interactive Data Visualization for the Web http://alignedleft.com/tutorials/d3

Data Visualization and D3.js https://www.udacity.com/course/data-visualization-and-d3js--ud507

Data Tells a Story: Reading Data in the Social Sciences and Humanities https://www.futurelearn.com/courses/data-explosion?lr=161

Building Data Visualization Tools https://www.coursera.org/learn/r-data-visualization

Fundamentals of Visualization with Tableau https://www.coursera.org/learn/data-visualization-tableau

Data Warehouse Concepts, Design, and Data Integration https://www.coursera.org/learn/dwdesign

Relational Database Support for Data Warehouses https://www.coursera.org/learn/dwrelational

Design and Build a Data Warehouse for Business Intelligence Implementation https://www.coursera.org/learn/data-warehouse-bi-building

Practical Predictive Analytics: Models and Methods https://www.coursera.org/learn/predictive-analytics

Introduction to Project Management TO3

75 https://www.coursesites.com/webapps/Bb-sites-course-creation- ADMINISTRATIVE BBLEARN/handleSelfEnrollment.htmlx?course_id=_239834_1 TASKS (FOR STATISTICAL Fundamentals of Project Planning and Management PURPOSES) https://www.futurelearn.com/courses/fundamentals-of-project-planning-and- management?lr=87

Software Product Management Capstone https://www.coursera.org/learn/software-product-management-capstone

Fundamentals of Management https://www.coursera.org/learn/fundamentals-of-management

Data Management and Visualization https://www.coursera.org/learn/data-visualization

Agile Planning for Software Products TO4 https://www.coursera.org/learn/agile-planning-for-software-products INFORMATION TECHNOLOGIES Software Processes and Agile Practices TASKS (FOR https://www.coursera.org/learn/software-processes-and-agile-practices STATISTICAL PURPOSES) Mastering Software Development in R Capstone https://www.coursera.org/learn/r-capstone

Software Debugging https://www.udacity.com/course/software-debugging-- cs259?utm_medium=referral&utm_campaign=api

Systems Thinking and Complexity https://www.futurelearn.com/courses/systems-thinking-complexity?lr=111

Data Science and Machine Learning Bootcamp with R TO5 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http PROGRAMMING s://www.udemy.com/data-science-and-machine-learning-bootcamp-with-r LANGUAGES

R for Data Science http://r4ds.had.co.nz

Statistics with R Capstone https://www.coursera.org/learn/statistics-project

R Programming https://www.coursera.org/learn/r-programming

The R Programming Environment https://www.coursera.org/learn/r-programming-environment

76

R for Data Science http://r4ds.had.co.nz

Python Data Science Handbook https://jakevdp.github.io/PythonDataScienceHandbook

Capstone: Retrieving, Processing, and Visualizing Data with Python https://www.coursera.org/learn/python-data-visualization

Programming for Everybody (Getting Started with Python) https://www.coursera.org/learn/python

Python Data Structures https://www.coursera.org/learn/python-data

Using Python to Access Web Data https://www.coursera.org/learn/python-network-data

Using Databases with Python https://www.coursera.org/learn/python-databases

Introduction to Data Science in Python https://www.coursera.org/learn/python-data-analysis

Official Python Tutorial https://docs.python.org/3/tutorial/index.html

Functional Programming Principles in Scala https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/progfun1

Scala by Example http://www.scala-lang.org/docu/files/ScalaByExample.pdf

Scala: Learn by Example https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/learn-by-example-scala

Scala School http://twitter.github.io/scala_school

Programming in Scala http://www.artima.com/pins1ed

Effective Scala http://twitter.github.io/effectivescala

Introduction to Programming and Problem Solving Using Scala

77 https://hackr.io/tutorial/introduction-to-programming-and-problem-solving- using-scala

Learning Scala - Joel Abrahamsson http://joelabrahamsson.com/learning-scala

Scala Exercises https://www.scala-exercises.org

Distributed Programming in Java TO6 https://www.coursera.org/learn/distributed-programming-in-java ARCHITECTURE TOOLS AND Advanced Operating Systems TECHNOLOGIES https://www.udacity.com/course/advanced-operating-systems-- ud189?utm_medium=referral&utm_campaign=api

Intro to Hadoop and MapReduce https://www.udacity.com/course/intro-to-hadoop-and-mapreduce-- ud617?utm_medium=referral&utm_campaign=api

Deploying a Hadoop Cluster https://www.udacity.com/course/deploying-a-hadoop-cluster-- ud1000?utm_medium=referral&utm_campaign=api

Software Architecture for the Internet of Things https://www.coursera.org/learn/iot-software-architecture

Cloud Computing Concepts, Part 1 TO7 https://www.coursera.org/learn/cloud-computing CLOUD TOOLS AND TECHNOLOGIES Cloud Computing Concepts: Part 2 https://www.coursera.org/learn/cloud-computing-2

Big Data, Cloud Computing, & CDN Emerging Technologies https://www.coursera.org/learn/big-data-cloud-computing-cdn

Cloud Computing Project https://www.coursera.org/learn/cloud-computing-project

Cloud Computing Applications, Part 1: Cloud Systems and Infrastructure https://www.coursera.org/learn/cloud-applications-part1

Cloud Networking https://www.coursera.org/learn/cloud-networking Algorithmic Thinking (Part 1) TO8 https://www.coursera.org/learn/algorithmic-thinking-1 UPPER-LEVEL DATA SCIENCE TOOLS Algorithmic Thinking (Part 2)

78 https://www.coursera.org/learn/algorithmic-thinking-2 AND TECHNOLOGIES Learn Algorithms by Solving Challenges https://www.learneroo.com/subjects/8

Introduction to Algorithms and Data structures in C++ https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/introduction-to-algorithms-and-data-structures-in-c

Machine Learning A-Z: Hands-On Python & R in Data Science https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/machinelearning

Machine Learning Foundations: A Case Study Approach https://www.coursera.org/learn/ml-foundations

Neural Networks for Machine Learning https://www.coursera.org/learn/neural-networks

Applied Machine Learning in Python https://www.coursera.org/learn/python-machine-learning

Machine Learning with Python https://hackr.io/tutorial/machine-learning-with-python

Practical Machine Learning https://www.coursera.org/learn/practical-machine-learning

Serverless Machine Learning with Tensorflow on https://www.coursera.org/learn/serverless-machine-learning-gcp

Machine Learning https://www.coursera.org/learn/machine-learning

Machine Learning for Data Analysis https://www.coursera.org/learn/machine-learning-data-analysis

Machine Learning with Big Data https://www.coursera.org/learn/big-data-machine-learning

Machine Learning Foundations: A Case Study Approach https://www.coursera.org/learn/ml-foundations

Machine Learning: Classification https://www.coursera.org/learn/ml-classification

Intro to Machine Learning

79 https://www.udacity.com/course/intro-to-machine-learning-- ud120?utm_medium=referral&utm_campaign=api

Predictive Modeling and Analytics https://www.coursera.org/learn/predictive-modeling-analytics

Pattern Discovery in Data Mining https://www.coursera.org/learn/data-patterns

Data Mining Project https://www.coursera.org/learn/data-mining-project

Deep Learning http://www.deeplearningbook.org

Neural Networks and Deep Learning https://www.coursera.org/learn/neural-networks-deep-learning

Deep Learning https://www.udacity.com/course/deep-learning-- ud730?utm_medium=referral&utm_campaign=api

Applied Text Mining in Python https://www.coursera.org/learn/python-text-mining

Introduction to Natural Language Processing https://www.coursera.org/learn/natural-language-processing

Applied Social Network Analysis in Python https://www.coursera.org/learn/python-social-network-analysis

Social Media Analytics: Using Data to Understand Public Conversations https://www.futurelearn.com/courses/social-media-analytics?lr=137

Intro to Artificial Intelligence https://www.udacity.com/course/intro-to-artificial-intelligence-- cs271?utm_medium=referral&utm_campaign=api

Artificial Intelligence https://www.udacity.com/course/artificial-intelligence-- ud954?utm_medium=referral&utm_campaign=api

Learn the fundamentals of Artificial Intelligence http://www.awin1.com/cread.php?awinmid=6798&awinaffid=428263&p=https ://www.edx.org/course/artificial-intelligence-ai-columbiax-csmm-101x-0

MIT Open Courseware - Artificial Intelligence

80 https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6- 034-artificial-intelligence-fall-2010/lecture-videos

Intro to AI — UC Berkeley CS188 http://ai.berkeley.edu/lecture_videos.html

Advanced AI: Deep Reinforcement Learning in Python https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/deep-reinforcement-learning-in-python

Internet of Things & Augmented Reality Emerging Technologies https://www.coursera.org/learn/iot-augmented-reality-technologies

Big Data, Cloud Computing, & CDN Emerging Technologies https://www.coursera.org/learn/big-data-cloud-computing-cdn

Internet of Things: Setting Up Your DragonBoard™ Development Platform https://www.coursera.org/learn/internet-of-things-dragonboard

Internet of Things: How did we get here? https://www.coursera.org/learn/internet-of-things-history

Internet of Things Capstone: Build a Mobile Surveillance System https://www.coursera.org/learn/internet-of-things-capstone

Introduction to the Internet of Things and Embedded Systems https://www.coursera.org/learn/iot

Programming for the Internet of Things Project https://www.coursera.org/learn/internet-of-things-project

Embedded Systems https://www.udacity.com/course/embedded-systems-- ud169?utm_medium=referral&utm_campaign=api

Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform https://www.coursera.org/learn/leveraging-unstructured-data-dataproc-gcp

Taming Big Data with Apache Spark and Python TO9 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http STATISTICS AND s://www.udemy.com/taming-big-data-with-apache-spark-hands-on BUSINESS INTELLIGENCE Apache Spark in Python: Beginner's Guide https://www.datacamp.com/community/tutorials/apache-spark- python#gs.fMIIqxM

Apache Spark 2.0 with Scala

81 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/apache-spark-with-scala-hands-on-with-big-data

Scalable Programming with Scala and Spark https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/scalable-programming-with-scala-and-spark

Big Data Analysis with Scala and Spark https://www.coursera.org/learn/scala-spark-big-data

Statistical Analysis for Educational Re•searchers https://www.openlearning.com/courses/statisticalanalysis

Getting Started with Apache Cassandra TO10 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http DATA s://www.udemy.com/apache-cassandra MANAGEMENT

Database Systems Concepts & Design TO11 https://www.udacity.com/course/database-systems-concepts-design-- DATABASES ud150?utm_medium=referral&utm_campaign=api

An Introduction to Database https://www.openlearning.com/courses/databaseanintroduction

Database Fundamentals http://www.microsoftvirtualacademy.com/training-courses/database- fundamentals

Database Management Essentials https://www.coursera.org/learn/database-management

Data Manipulation at Scale: Systems and Algorithms https://www.coursera.org/learn/data-manipulation

Intro to Relational Databases https://www.udacity.com/course/intro-to-relational-databases-- ud197?utm_medium=referral&utm_campaign=api

Relational Database Support for Data Warehouses https://www.coursera.org/learn/dwrelational

SQL basics by Khan Academy https://www.khanacademy.org/computing/computer-programming/sql/sql- basics

A beginners guide to thinking in SQL http://www.sohamkamani.com/blog/2016/07/07/a-beginners-guide-to-sql

82

Get Started with SQL Programming http://www.ntu.edu.sg/home/ehchua/programming/sql/MySQL_HowTo.html

SQL Tutorial by Tutorials Point http://www.tutorialspoint.com/sql/sql_tutorial.pdf

Learn SQL the Hard Way https://learncodethehardway.org/sql

Try SQL http://campus.codeschool.com/courses/try-sql/contents

Managing Big Data with MySQL https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/analytics-mysql

SQL for Newbies: Data Analysis for Beginners https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/sql-for-newbs

NoSQL Databases (General) http://www.christof-strauch.de/nosqldbs.pdf

Server-side Development with NodeJS, Express and MongoDB https://www.coursera.org/learn/server-side-nodejs

Web Application Development with JavaScript and MongoDB https://www.coursera.org/learn/web-application-development

Data Wrangling with MongoDB https://www.udacity.com/course/data-wrangling-with-mongodb-- ud032?utm_medium=referral&utm_campaign=api

The MongoDB Manual http://docs.mongodb.org/manual

The Complete Developer's Guide to MongoDB https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/the-complete-developers-guide-to-mongodb

The Little MongoDB Book http://openmymind.net/2011/3/28/The-Little-MongoDB-Book

MongoDB Tutorial for Beginners https://hackr.io/tutorial/mongodb-tutorial-for-beginners

MongoDB for Node.js Developers https://university.mongodb.com/courses/M101JS/about

83

React Native with an Express/MongoDB Backend https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/build-your-next-app-with-react-native-and-express

MongoDB for Beginners Tutorials https://hackr.io/tutorial/mongodb-for-beginners-tutorials

Text Retrieval and Search Engines TO12 https://www.coursera.org/learn/text-retrieval SEARCH TECHNOLOGIES Introduction to Search Engine Optimization https://www.coursera.org/learn/search-engine-optimization

Search Engine Optimization Fundamentals https://www.coursera.org/learn/seo-fundamentals

Building Data Visualization Tools TO13 https://www.coursera.org/learn/r-data-visualization VISUALIZATION TECHNOLOGIES Interactive Data Visualization for the Web http://alignedleft.com/tutorials/d3

Data Visualization and D3.js https://www.udacity.com/course/data-visualization-and-d3js--ud507

Big Data Integration and Processing TO14 https://www.coursera.org/learn/big-data-integration-processing HADOOP

Deploying a Hadoop Cluster https://www.udacity.com/course/deploying-a-hadoop-cluster-- ud1000?utm_medium=referral&utm_campaign=api

Hadoop Platform and Application Framework https://www.coursera.org/learn/hadoop

Hadoop Illuminated http://hadoopilluminated.com/index.html

Hadoop Tutorial http://www.tutorialspoint.com/hadoop/index.htm

Become a Hadoop Developer https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/hadoop-tutorial/

Hadoop Platform and Application Framework

84

https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/hadoop

The Ultimate Hands-On Hadoop https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/the-ultimate-hands-on-hadoop-tame-your-big-data/

Hadoop Tutorial by Tutorials Point http://www.tutorialspoint.com/hadoop/hadoop_tutorial.pdf

Bigdata and Hadoop Tutorial https://hackr.io/tutorial/bigdata-and-hadoop-tutorial

Hadoop by Durga Software https://hackr.io/tutorial/hadoop-by-durga-software Data Mining Project TO15 https://www.coursera.org/learn/data-mining-project DATA MINING TOOLS AND Pattern Discovery in Data Mining TECHNOLOGIES https://www.coursera.org/learn/data-patterns

Data Visualization https://www.coursera.org/learn/datavisualization

Introduction to Data Science in Python https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/python-data-analysis

Data Science Fundamentals https://bigdatauniversity.com/learn/data-science

6. Strategic analysis of bridging the gap via training

6.1 Training channels

6.1.1 Advantages and disadvantages of face-to-face training Face-to-face training courses, such as the ones provided by ESTP [29], are beneficial from the following perspectives:

 Networking. A real life human interaction with another person provides an important training feature and leads to increases in networking feasibilities.  Engagement. The face-to-face training proves to be more focused at concrete tasks in a concrete moment in time.  Discussion. It is easy to have an open discussion in face-to-face training.  Specificity. The face-to-face trainings can be adapted for concrete group’s requirements.  Feedback. In a face-to-face training it is easy to obtain a feedback and help from instructor.

85

However, there are a number of disadvantages associated with face-to-face trainings, such as:

 Unsuitable for some people. Face-to-face trainings can be unsuitable in terms of time constraints and costs.  Unsuitable for large audiences. Face-to-face trainings cannot cover very large groups of people.  Unsuitable for large organizations. Face-to-face training is less suitable for large organizations, since they have branches located at different locations.  Low reference value. Face to face communication is oral; no written records are kept.  Poor retention by listener. The listener can process more information than given at face-to-face training.

6.1.2 European Statistics Training Programme The European Statistical Training Programme (ESTP) [29] is the main training channel of Eurostat concerning European statistics. The ESTP contains a set of courses, some of which partially cover the specified training objectives. In particular, from Table 17 we can see that there are ESTP courses targeted at soft skills, data science and statistical tasks and at big data tools and technologies exist.

6.1.3 European Master in Official Statistics The European Master in Official Statistics (EMOS) [30] is a project aimed at developing a programme for training and education of potential future official statisticians within existing Master programmes at European universities.

EMOS-labelled Master is made up of four main parts:

- EMOS module (approx. 10% of ECTS credits);

- Semi-elective courses (approx. 30% of ECTS credits);

- Elective courses (approx. 25% of ECTS credits);

- Internship and Master thesis (approx. 35% of ECTS credits)

The EMOS provides students with an advanced training in the area of statistics in general and official statistics in particular. The project suggests complementary quantitative and statistical tools and enhances the abilities of students to understand and to be able to analyse European official statistics at different levels: quality, production process, dissemination, and analysis in a national, European and international context.

6.1.4 Online Learning E-learning theory describes the cognitive science principles of effective multimedia learning using electronic educational technology [31].

The possible online training channels are described below.

Webinars

A webinar [32] is an event held on the internet which is attended by an online audience. The video can be broadcasted along with a PowerPoint in sync and screen capture can be used.

86

A webinar is a form of one-to-many communication: a presenter can reach a large and specific group of online viewers from a single location. Webinars are widely attended. The participants use the following interactive opportunities:

 Ask a question  Chat  Poll  Survey  Test  Call to action  Twitter

MOOCs

A massive open online course [33] is an online course aimed at unlimited participation and open access via the web. MOOCs provide recorded video lectures, problem sets, quizzes and other materials. The interaction is possible at user forums. MOOCs are characterized by massive enrollments. Consequently MOOCs require instructional design that facilitates large-scale feedback and interaction, such as peer- review, group collaboration and automated feedback through objective, online assessments, e.g. quizzes and exams.

Videolectures

Videolectures.net [34] is an award-winning free and open access educational video lectures repository. The lectures are given by distinguished scholars and scientists at the most important and prominent events like conferences, summer schools, workshops, and science promotional events from many fields of Science. The portal is aimed at promoting science, exchanging ideas and fostering knowledge sharing by providing high quality didactic contents not only to the scientific community, but also to the general public. All lectures, accompanying documents, information and links are systematically selected and classified through the editorial process taking into account also users' comments.

Table 19 presents a sample of videolectures related to the big data available in videolectures.net.

Table 19: Videolectures in the Area of Big Data Title Number of views Big-Data Tutorial 14132 views BigData and MapReduce with Hadoop 1937 views Big Data Clustering 1930 views Mining Big Data in Real Time 945 views On Big Data Algorithmics 799 views Text Analytics and Big Data 531 views Sampling for Big Data 385 views Big Data – Big opportunities – Big risks? And what about 253 views Europe? Making (Big) Data 153 views Technological challenges of Big-Data 64 views

87

Personalized learning portal

Personalized learning portal is a portal designed to provide educators, administrators and learners with a single robust, secure and integrated system to create personalized learning environments. The software provided by such platforms can be installed on web servers. Moodle [35] is a personalized learning portal designed to support both teaching and learning. It is free with no licensing fees, easy to use with multilingual capabilities. The Moodle project is well-supported by an active international community.

The EDSA online courses portal [36] is based on the Moodle Learning Management System. A Learning Management System (LMS) is an online software application offering facilities for student registration, enrolment into courses, delivery of learning materials to students, student assessment and progress monitoring. Moodle is an open-source learning platform designed to provide educators, administrators and learners with a single robust, secure and integrated system to create personalized learning environments. Moodle has been adopted by numerous educational institutions worldwide, including the Open University. Moodle currently has more than 79 million users across the academic and enterprise sectors. These figures make it the world’s most widely used learning platform. Additionally, as it is open source it has attracted a sizeable community of developers, which offers a wide range of free and open plugins that extend and enrich the functionalities provided by Moodle.

Figure 22 presents a snapshot of EDSA online courses portal.

Figure 22: EDSA online courses portal

88

Figure 23 presents an example of EDSA learning pathway that is automatically provided based on skills selected by the user. In such way, the user can follow their own learning pathways based on their qualifications and intentions.

Figure 23: EDSA learning pathways

6.1.5 Other possible training channels Other possible training channels include training workshops and self-study materials. Training workshops have the benefits of face-to-face training and, at the same time, they empower the collaboration and networking aspects of the trainings.

Blended training format includes both elements of face-to-face trainings supported with online learning.

Training workshops

A training workshop is a type of interactive training where participants carry out a number of training activities rather than passively listen to a lecture or presentation [37].

Self-study materials

An electronic book (or e-book) is a digital book, consisting of text, images, or both, readable on the flat- panel display of computers or other electronic devices. eBook (ePUB format) [38] is available to download and use even without an internet connection on iPads and iPhones (iBooks format), as well as other tablets and smartphones (ePUB format).

89

6.2 Defining strategic training plan for Big Data in official statistics

6.2.1 ADDIE model ADDIE is an instructional system design framework for training developers who are developing courses [39, 40]. ADDIE model includes a number of phases, such as Analysis, Design, Development, Implementation and Evaluation in details described below.

ADDIE model has a number of specific goals to be achieved:

 Evaluation of the trainees needs;  Design and development of training materials;  Trainees reach the training objectives and achieve the defined learning outcomes;  Evaluation of the training process.

6.2.2 Analysis Initial analysis and assessment of training needs is one of the important stages of ADDIE model. Previous sections (Section 2) covered an analysis of skills required for working with big data tasks in statistical domain, while Section 3 and Section 4 provided big data training needs from the perspective of focal points in statistical offices. The training objectives have been defined in Section 5.

The outcomes for the strategic training plan are usually grouped as short, medium and long term outcomes, which can be further expanded (Table 20):

Table 20: Expected Outcomes for Big Data in Statistics Outcomes Change in Explanation Short Knowledge The relevant personnel in statistical offices around Skills Europe will get awareness about existence of big data Attitude and their possible applications in statistics. Motivation Awareness The change of attitude should happen for personnel of NSIs with regards to using new and emerging technologies in statistics.

The relevant NSI personal will get skills for working with big data depending on their background, motivation and training needs of particular NSI – statisticians at the level of Familiarity, IT experts at the level of Usage, Managers at the level of Assessment.

Medium Behaviors The level of expertise of working with big data will be Practices constantly supported in NSIs around Europe. Policies Procedure The behavior and practices changes around using of big data and other emerging technologies for statistics will be observed.

90

The new standards, policies and procedures will be taken.

Long Situation – environment, The overall situation in statistical domain in Europe social and economic will change – where the knowledge and new conditions technologies would drive the statistical production.

6.2.3 Design In the Design phase of training development, it is defined how the training courses should look in order to meet the needs from the Analysis phase.

In particular, for the purposes of big data training for NSIs in Europe, the Design phase should answer the questions:

 How short-term, medium-term and long-term training programme should be implemented?  Which topics/objectives could be covered via face-to-face trainings (ESTP, workshops) in short- term, medium-term and long-term?  Which online mechanisms can be implemented in short-term, medium-term and long-term?  Which topics/objectives should have blended training format (face-to-face with supportive online materials) in short-term, medium-term and long-term?

For the specific courses:

 How courses content should be organized?  How should ideas be presented to participants?  What delivery formats should be used?  What types of activities and exercises will be more suitable for participants?  How should the trainees be evaluated?

The basic steps for a specific course design are the following:

 Planning the instructional strategy  Selecting the course format  Writing the instructional design document

6.2.4 Development The development phase where training contents are created and assembled, course materials are produced according to decisions made during the design and analysis phases.

It includes determining and developing appropriate activities and evaluation. The Development phase of the specific course preparation can be broken down into the following five components:

 Review/revise existing information sources/training materials;  Selecting appropriate methods and media;  Developing all new course material;  Validating course materials; and  Developing an Instructional Management Plan

91

In particular, content suggested as part of deliverable D4.4 could provide the basics for course development in the area of big data.

6.2.5 Implementation The Implementation phase follows the Development phase and ensures that

 The course meets important business goals;  The course covers content that learners need to know;  The course reflects the learners existing capabilities.

The implementation phase contains a number of common issues related to face-to-face training and a set of issues related to online training or e-learning.

Common Issues

Course Materials

 How many copies of the course materials need to be printed?  Will course materials be printed in-house or outsourced to a printer?  How will course materials be delivered and who will be responsible?

Instructors

 How many trainers will be needed for the project?  Will the trainers come from an in-house team or from an outside provider?  Will the project require the trainers to travel?  Should the trainers be geographically-based?  How will the instructors learn to teach this course?  Will the project require a train-the-trainer session?  When and how will trainers receive their schedule?  Who will be the technical contact for trainers?  Can enhanced/leveraged use of multimedia training/partnerships be included?

Course Schedule

 Where will the courses be offered?  On what dates and times will the course be offered?  How will this schedule be communicated?

Classroom Space

 Will the classroom require any specific technology—computers, light box, etc.?  Will the classroom require desks, tables or just chairs?

Registration

 How will learners be enrolled for the course?  How will course rosters be tracked?  How will rosters be communicated to instructors?

92

 How will instructors record attendance and test scores?  Will this course be entered into a learning management system?

Logistics

 Who will manage training administration?  Who will manage training logistics?  Who will be responsible for collecting and communicating these statistics?

E-learning Issues

Hosting

 Where will the course be hosted?  How many learners will need to access the course in total?  How many learners will need to access the course at any one time?

Access

 How will learners enroll for the course?  Will learners be able to access the course through the web or will they need to connect to an intranet?

Learners’ Computers

 Who will ensure all sites have internet-ready computers?  Who will ensure that learners have all necessary applications loaded onto to their computers?  Will learners need to download any applications or plug ins?

The implementation of training strategy is performed based on decisions taken in design and analysis phases described above.

6.2.6 Evaluation Seminar/outcome can be measured in one of the following metrics:

 Were the goals met as set out in the analysis phase?  Was there an improvement in a set of skills observed?  Was there an increase in the training attendance?

93

7. References [1] Chun‑Wei Tsai, Chin‑Feng Lai, Han‑Chieh Chao. et al., Big data analytics: a survey. Journal of Big Data (2015) 2: 21, doi:10.1186/s40537-015-0030-3.

[2] Nada Elgendy, Ahmed Elragal, Big Data Analytics: A Literature Review. Advances in Data Mining. Applications and Theoretical Aspects, Vol. 8557 (2014), pp. 214-227, doi:10.1007/978-3-319-08976-8_16.

[3] Kubick, W.R.: Big Data, Information and Meaning. In: Clinical Trial Insights (2012), pp. 26–28.

[4] Paul MacDonnell and Daniel Castro. Europe Should Embrace the Data Revolution. Center for Data Innovation (2016), http://www2.datainnovation.org/2016-europe-embrace-data-revolution.pdf.

[5] IDG Enterprise Data and Analytics Survey 2016, http://core0.staticworld.net/assets/2016/06/29/idge- data-analysis-2016.pdf.

[6] EDSA project, http://edsa-project.eu (accessed in January 2017).

[7] SARO ontology, http://eis.iai.uni-bonn.de/Projects/SARO.html (accessed in January 2017).

[8] BDVA reports, http://www.bdva.eu/?q=big-data-reports (accessed in January 2017).

[9] O'Reilly's 2016 Data Science Salary Survey, http://www.oreilly.com/data/free/2016-data-science- salary-survey.csp?intcmp=il-data-free-lp-lgen_free_reports_page.

[10] Adzuna API, https://developer.adzuna.com/overview (accessed in January 2017).

[11] L. Ratinov, D. Roth, D. Downey, and M. Anderson, Local and global algorithms for disambiguation to Wikipedia. ACL (2011).

[12] JSI Wikifier, http://wikifier.org (accessed in January 2017).

[13] GeoNames ontology, http://www.geonames.org/ontology/documentation.html (accessed in January 2017).

[14] Microsoft Academic Graph, https://www.microsoft.com/en-us/research/project/microsoft- academic-graph (accessed in April, 2017).

[15] Ontogen tool, ontogen.ijs.si (accessed in January 2017).

[16] Letheby R.S, Nicholson D., The ABS statistical capability framework – the first step in transforming the statistical capabbility learning environment, http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.54/2014/Australia_The_ABS_Statis tical_Capability_Framework_01.pdf (accessed January 2017).

[17] EDISON Project: Building the Data Science Profession [online] http://edison-project.eu

[18] B. S. Bloom, M. D. Engelhart, E. J. Furst, W. H. Hill, D. R. Krathwohl (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. New York: David McKay Company.

[19] D. A. Kolb. Experiential learning: experience as the source of learning and development. Prentice-Hall, 1984.

94

[20] T. W. Wlodarczyk, T. J. Hacker, "Problem-Based Learning Approach to a Course in Data Intensive Systems." Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on. IEEE, 2014.

[21] The Aalborg Model for Problem Based Learning (PBL) [online] http://www.en.aau.dk/education/problem-based-learning

[22] J. Biggs, “Enhancing teaching through constructive alignment,” Higher education, vol. 32, no. 3, pp. 347–364, 1996.

[23] M. Ben-Ari, “Constructivism in computer science education,” Journal of Computers in Mathematics and Science Teaching, vol. 20, no. 1, pp. 45–73, 2001.

[24] Data Science Competence Framework [online] http://edison-project.eu/data-science-competence- framework-cf-ds

[25] Information Technology Competency Model of Core Learning Outcomes and Assessment for Associate-Degree Curriculum (2014) http://www.capspace.org/uploads/ACMITCompetencyModel14October2014.pdf

[26] A. Sasha Thackaberry, A CBE Overview: The Recent History of CBE [online] http://evolllution.com/programming/applied-and-experiential-learning/a-cbe-overview-the-recent- history-of-cbe

[27] Computer Science 2013: Curriculum Guidelines for Undergraduate Programs in Computer Science http://www.acm.org/education/CS2013-final-report.pdf

[28] EDSA Dashboard [online] http://jobs.videolectures.net

[29] ESTP program [online] http://ec.europa.eu/eurostat/web/european-statistical-system/training- programme-estp

[30] EMOS project [online] http://www.cros-portal.eu/content/emos

[31] R.E. Mayer, R. Moreno (1998). A Cognitive Theory of Multimedia Learning: Implications for Design Principles (PDF).

[32] What is a webinar? [online] https://www.webinar.nl/en/webinars/what-is-a-webinar

[33] Massive open online course [online] https://en.wikipedia.org/wiki/Massive_open_online_course

[34] VideoLectures.NET [online] http://videolectures.net

[35] About Moodle [online] https://docs.moodle.org/34/en/About_Moodle

[36] EDSA courses portal [online] http://courses.edsa-project.eu

[37] R. L. Jolles (2005). How to Run Seminars and Workshops (3 ed.). John Wiley & Sons. pp. 5, 12, 48, 155, 320. ISBN 978-0-471-71587-0. Retrieved 2014-11-23.

95

[38] E-book [online] https://en.wikipedia.org/wiki/E-book

[39] G. R. Morrison (2010). Designing Effective Instruction, 6th Edition. John Wiley & Sons.

[40] Strategic Training Plan [online] http://www.nj.gov/dep/transformation/enforcement/docs/092811trn/Strategic%20Training%20Plan%2 09-27-2011.pdf

[41] I. Novalija, M. Grobelnik. Deliverable D4.1: Report containing a description of the skills required to process and analyse big data sources for the purpose of official statistics. April, 2017.

[42] I. Novalija, M. Grobelnik. Deliverable D4.2 and D4.2: Report containing an analysis of the existing skills in the statistical offices of the ESS, Eurostat and NSIs (Deliverable 4.2), and Report containing an analysis of the training needs of the statistical offices of the ESS, Eurostat and NSIs (Deliverable 4.3). September, 2017.

[43] I. Novalija, M. Grobelnik. Report containing training objectives and content ready to be used in the design of a training programme to supply the statistical offices in Europe with the skills required to use big data sources in statistical production (Deliverable 4.4). November, 2017.

[44] I. Novalija, M. Grobelnik. Report containing a strategic analysis of how the skills gap can be bridged via training (Deliverable 4.5). November, 2017.

96

8. Annex

8.1 Appendix: Recommended Literature Sources in the Area of Big Data Credit card data:

[Sobolevsky,et.al.](2015)Predicting Regional Economic Indices using Big Data of Individual Bank Card Transactions

Mobile network data:

[Ahas,et.al](2008)Evaluationg passive mobile in tourism

[Blondel,Decuyper,Krings](2015)A survey of results on mobile phone datasets analysis

[Bogomolov,et.al](2014)Once Upon a Crime - Towards Crime Prediction from Demographics and Mobile Data

[Csaji,et.al.](2014)Exploring the Mobility of Mobile Phone Users

[Decuyper, et.al.](2014)Estimating Food Consumption and Poverty indices with Mobile Phone Data

[Deville,et.al.](2014)Dynamic population mapping using mobile phone data

[Diminescu,Licoppe,Smoreda,Ziemlicki](2006)Using Mobile Phone Geolocation Data for the Analysis of Patterns of Coordination

[Fillekes](2014)Reconstructing trajectories from sparse call detail records

[Frias-Martinez,Frias-Martinez,Oliver](2010)A Gender-Centric Analysis of Calling Behavior in a Developing Economy Using Call Detail Records

[Frias-Martinez,Soguero-Ruiz,Josephidou](2013)Forecasting Socioeconomic Trends With Cell Phone Records

[Furletti,et.al](2014)Use of mobile phone data to estimate mobility flows

[Furletti,Gabrielli,Renso,Rinzivillo](2013)Analysis of GSM calls data for understanding user mobility behaviour

[GP](2015)Mapping the risk-utility landscape of mobile phone data

[Hongyan,Fasheng](2013)Estimating freeway traffic measures from mobile phone location data

[Kanasugi,et.al.](2013)Spatiotemporal route estimation consistent with human mobility using cellular network data

[Licoppe,et.al.](2008)Using mobile phone geolocalisation for socio-geographical analysis

[Monsted,Mollgard,Mathiesen](2016)Phone-based Metric as a Predictor for Basic Personality Traits

[Montjoye,et.al.](2013)Predicting Personality Using Novel Mobile Phone-Based Metrics

[Pappalardo, et.al.](2014)Human Mobility, Social Networks and Economic Development

97

[Pei,et.al](2014)A New Insight into Land Use Classification Based on Aggregated Mobile Phone Data

[Reades,Reades,Ratti](2009)Eigenplaces - analyzing cities using the space-time structure of the mobile phone network

[Rey-del-Castillo,CardeĄosa](2016)An Exercise in Exploring Big Data for Producing Reliable Statistical Information

[Sapiezynski,et.al.](2015)Tracking Human Mobility Using WiFi Signals

[Soto,Frias-Martinez,Virseda](2011)Prediction of Socioeconomic Levels using Cell Phone Records

[Toole,et.al.](2015)Tracking Employment Shocks Using Mobile Phone Data

[Vanhoof](2014)PhD 6 months report

Mobile phone & wearables sensors data:

[AAPOR](2014)Mobile Technologies for Conducting, Augmenting and Potentially Replacing Surveys

[Bouwman,Heerschap,Reuver](2013)Smartphone measurement study 2012

[Fernee,Sonck,Scherpenzeel](2013)Data collection with smartphones - experiences in a time use survey

[Mastrandrea,Fournet,Barrat](2015)Comparison between Data Collected Using Wearable Sensors, Contact Diaries and Friendship Surveys

[Neverova,et.al.](2016)Learning Human Identity from Motion Patterns

[Parslow](2014)How big data could be used to predict a patient's future

Network data:

[Benedictis,Tajoli](2010)Comparing sectoral international trade networks

[Benedictis,Tajoli](2011)The World Trade Network

[Iapadre,Tajoli](2014)Emerging countries and trade regionalization - A network analysis

[Kurka,Godoy,Zuben](2016)Online Social Network Analysis

[Mandel](2014)Connections as a tool for growth - evidence from the LinkedIn economic graph

[Piccardi,Tajoli](2015)Are Preferential Agreements Significant for the World Trade Structure - A Network Community Analysis

Text Analytics:

[Blei](2012)Probabilistic Topic Models

[Gillick,et.al.](2016)Multilingual Language Processing From Byte

[Rehurek,Kolkus](2009)Language Identification on the Web - Extending the Dictionary Method

[Schakel,Wilson](2015)Measuring Word Significance using Distributed Representations of Words

98

[Sonoda,Daisuke](2015)Predicting Latent Trends of Labels in the Social Media Using Infectious Capacity

[Spaniol,Prytkova,Weikum](2013)Knowledge Linking for Online Statistics

Web data:

[AAPOR](2014)Social Media in Public Opinion Research

[Antenucci,et.al.](2014)Using Social Media to Measure Labor Market Flows

[Arrington](2006) Launches

[Askitas,Zimmermann](2009)Google Econometrics and Unemployment Forecasting

[Bacchini,et.al](2014)Does Google index improve the forecast of Italian labour market

[Banbura,et.al.](2013)Now-casting and the real-time data flow

[Barreira,et.al](2013)Nowcasting with Google Trends in an Emerging Market

[Beiro](2016)Predicting human mobility through the assimilation of social media traces into mobility models

[Berg](2013)Evaluating Quality of Online Behavior Data

[Breton,et.al](2015)Research indices using web scraped data

[Bughin](2011)Nowcasting the Belgian Economy

[Butler](2013)When Google got flu wrong

[Carriere,Labbe](2010)Nowcasting with google trends in an emerging market

[Carriere,Labbe](2013)Nowcasting with google trends in an emerging market

[Chadwik,Sengul](2012)Nowcasting unemployment rate in turkey let's ask Google

[Chamberlin](2010)Googling the present

[Choi,Varian](2009)Predicting Initial Claims for Unemployment Benefits

[Choi,Varian](2009)Predicting the Present with Google Trends

[Choi,Varian](2012)Predicting the Present with Google Trends

[Choudhury,et.al](2010)Sampling Impact on Discovery of Information Diffusion in Social Media

[Compton, Jurgens, Allen](2014)Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization

[Cook,et.al](2011)Assessing Google Flu Trends Performance

[Curti,Iacus,Porro](2015)Measuring Social Well Being in The Big Data Era - Asking or Listening

[Daas,Puts](2014)Social media sentiment and consumer confidence

[DAmuri,Marcucci](2009)Forecasting US unemployment with Google job search index

99

[DAmuri,Marcucci](2009)Google it - forecasting the US unemployment rate with google job search index

[Dombrovskyi](2014)Using internet search data for nowcasting unemployment rate in Ukraine

[Ettredge,Gerdes,Karuga](2005)Using Web-based Search Data to Predict Macroeconomic Statistics

[European Commission](2010)Internet as data source

[Falorsi,Naccarato,Pierini](2015)Using google trend data to predict the Italian unemployment rate

[Fondeur,Karame](2013)Can google data help predict French unemployment

[Fung](2014)Google Flu Trends Failure Shows Good Data better than Big Data

[Gayo-Avello](2012)A Balanced Survey on Election Prediction using Twitter Data

[Ginsberg,et.al](2009)Detecting influenza epidemics using search engine query data - Supplementary information 1

[Ginsberg,et.al](2009)Detecting influenza epidemics using search engine query data

[Hamid,Heiden](2014)Forecasting Volatility with Empirical Similarity and Google Trends

[Kapounek](2016)Determinants of Foreign Currency Savings - Evidence from Data

[Kholodilin,Podstawski,Siliverstovs,Bürgi](2009)Google Searches as Means of Improving Nowcasts of Macroeconomic Variables

[Kholodilin,Podstawski,Siliverstovs](2010)Do Google searches help in nowcasting private consumption

[Koop,Onorante](2013)Macroeconomic Nowcasting Using Google Probabilities

[Kuhn,Skuterud](2004)Internet Job Search and Unemployment Durations

[Lampos,et.al.](2015)Advances in nowcasting influenzalike illness rates using search query logs

[Lazer, Kennedy, King, Vespignani](2014)The Parable of Google Flu - Traps in big data analysis

[Long,Shen](2014)Population specialization and synthesis with open data

[Mao,Counts,Bollen](2015)Quantifying the effects of online bullishness on international financial markets

[McIver, Brownstein](2014)Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time

[Miao,Ma](2015)The Dynamic Impact of Web Search Volume on Product Sales - An Empirical Study Based on Box Office Revenues

[Milinovich,et.al.](2014)Using internet search queries for infectious disease surveillance

[Mohebbi,et.al](2011)Google Correlate Whitepaper

[Olson,et.al](2013)Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza

[Preis,Moat](2014)Adaptive nowcasting of influenza outbreaks using Google searches

100

[Rivera](2015)Dynamic model to forecast hotel registrations using Google Trends data

[Rubin,Puranmalka](2014)Google insights into FNMA MBS prepayments

[Santillana,et.al](2014)What can disease detection learn from (an external revision to)Google Flu Trends

[Schmidt,Vosen](2011)Forecasting private consumption - survey-based indicators vs Google trends

[Seo,et.al.](2014)Cumulative Query Method for Influenza Surveillance Using Search Engine Data

[Shimshoni,Efron,Matias](2009)On the Predictability of Search Trends

[Siddiqui](2015)Mining wikipedia to rank rock guitarists

[Stilo,Vincenzi,Tozzi,Velardi](2013)Automated Learning of Everyday Patients Language for Medical Blogs Analytics

[The Economist](2014)The Economist explains - The backlash against big data

[Toth,Hajdu](2013)Google as a tool for nowcasting household consumption - estimation on Hungarian data

[Vicente,Menéndez,Pérez](2014)Forecasting unemployment with internet search data - Does it help to improve predictions when job destruction is skyrocketing

[Vosen,Schmidt](2011)Forecasting private consumption survey based indicators vs Google Trends

[Vosen,Schmidt](2012)A monthly consumption indicator for Germany based on Internet search query data

[Wang.et.al.](2014)Forecasting elections with non-representative polls

[Xiaoxuan](2016)Tourism forecasting by search engine data with noise-processing

[Zagheni,Kiran,State](2014)Inferring International and Internal Migration Patterns from Twitter Data

[Zeynalov](2014)Nowcasting Tourist Arrivals to Prague

Wikipedia:

[Ciglan,NŤrvćg](2010)WikiPop - Personalized Event Detection System Based on Wikipedia Page View Statistics

[Cozza,Petrocchi,Spognardi](2016)A matter of words - NLP for quality evaluation of Wikipedia medical articles

[Eom,et.al.](2015)Interactions of Cultures and Top People of Wikipedia from Ranking of 24 Language Editions

[Guisado-Gámez,Prat-Péres](2015)Understanding Graph Structure of Wikipedia for Query Expansion

[Katz,Shapira](2015)Enabling Complex Wikipedia Queries

[Khan,Khan,Mahmood](2015)Cloud service for assessment of news' Popularity in internet based on Google and Wikipedia indicators

101

[McIver, Brownstein](2014)Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time

[Milne,Witten](2012)An open-source toolkit for mining Wikipedia

[Munzert](2015)Using wikipedia page views statistics to measure issue salience

[Navarrete,Borowiecki](2015)Change in access after digitization - Ethnographic collections in Wikipedia

[Pohl](2012)Improving the wikipedia miner word sense disambiguation algorithm

[Yasseri,Bright](2015)Wikipedia traffic data and electoral prediction - towards theoretically informed models

[Yucesoy,Barabasi](2015)Untangling Performance from Success

102

8.2 Appendix: Trending Skills by Groups

Statistical tasks

Sampling

Legal acts Legal

Microdata

Calculation

Data access Data

Aggregation

Data analysis Data

Selection bias Selection

Quality control Quality

Data processing Data

Quality reporting Quality

Statistical surveys Statistical

Statistical content Statistical control Disclosure

Statistical systems Statistical

EUnomenclatures

Statistical analyses Statistical

Statistical software Statistical

Technical standards Technical

Statistical Indicators Statistical

Statistical databases Statistical

Multivariateanalysis

Seasonal adjustment Seasonal

Estimation techniques Estimation

Imputation techniques Imputation

Administrative sources Administrative

Model-based estimation Model-based

Statistical confidentiality Statistical Geographical information… Geographical Nowcasting and projections and Nowcasting

Administrative tasks for statistical purposes

TaskForces

Projectmonitoring

Quality assessment Quality

Administrative rules Administrative

Communication and Communication

information strategy information

Contractnegotiation

Peoplemanagement

andtechniques

Interservice consultation Interservice

European Statistical System Statistical European

Decision-making procedures Decision-making Communication instruments Communication Inter-institutional procedures Inter-institutional

103

Budget tasks for statistical purposes

Contract management Proposals writing Public Procurement Financial regulation and procedures

IT tasks for statistical purposes

Testing

Security

Training

Maintenance

Development

Implementation

Customersupport

System Architecture System

Statistical databases Statistical

Documentation writing Documentation Analysis of requirements of Analysis

Hardware and infrastructure Hardware

104

Data science tasks

prototype

dashboard

data search data

data sharing data

data storage data

data capture data

data cleaning data

data analysis data

data transfer data

data platform data

data querying data

data curation data

data modelling data

data conversion data

data warehouse data

data governance data

data visualization data data management data data standartization data

Architecture

HPCC

MIKE2.0

5C architecture 5C

distributed databases distributed

distributed computing distributed

data intensive data systems

distributed filesystems distributed

data intensive data computing High-Performance Computing High-Performance distributed parallel architecture parallel distributed

105

Data management technologies

Toad

Redis

Neo4J

Splunk

BigQuery

Cassandra

Couchbase

Apache Pig Apache

Apache Hive Apache

Apache Storm Apache

Apache Oozie Apache

Apache Mesos Apache

Apache Flume Apache

Apache HBase Apache

Apache Sqoop Apache

Apache Phoenix Apache

Cloudera Impala Cloudera

Amazon RedShift Amazon Apache ZooKeeper Apache Amazon DynamoDB Amazon

Data mining tools

H2O

Weka

BigML

Orange

LIBSVM

Scikit-learn

Big Insights Big

Spark MlLib Spark

RapidMiner

Vowpal Wabbit Vowpal Apache Mahout Apache Google Prediction Google

106

Databases technologies

sql

DB2

DBMS

SQLite

mysql

Oracle

NoSQL

Vertica

RDBMS

Netezza

Redshift

Teradata

database

mongodb

SAP HANA SAP

SQL Server SQL

PostgreSQL

Oracle Exascale Oracle

query languages query

EMC(Greenplum)

scripting languages scripting

Aster Data (Teradata) Data Aster

network-attached storage… network-attached

storage area network (SAN) network storagearea massively parallel-processing… massively direct-attached storage (DAS) direct-attached

Upper-level technologies

web service web

data mining data

deep learning deep

stream analysis stream

digital footprint digital

network analysis network

machine learning machine

stream processing stream

inductive statistics inductive

artificialintelligence

businessintelligence

software development software social network analysis socialnetwork

natural language processing natural

107

Hadoop

Apache HDFS Hadoop YARN Apache MapReduce Cloudera RHIPE

Programming Languages

C

C#

Go

c++

ECL

perl

java

Julia

Bash

scala

Ruby

Octave

python

IPython javascript VisualBasic

108

Search technologies

search-based applications ElasticSearch Solr Lucene

Statistics and business intelligence

R

SAS

SPSS

Dato

Excel

Stata

pbdR

matlab

Alteryx

Cognos

Pentaho

QlikView

Power BI Power

Oracle BI Oracle

Jaspersoft

PowerPivot

Mathematica

Apache Spark Apache

Microstrategy Adobe Analytics Adobe BusinessObjects

109

Visualization technologies

D3

Shiny

Plotly

NVD3

Bokeh

ggplot

Leaflet

InfoVis

Chart.js

Tableau

Visual.ly

Sigma JS Sigma

Infogram

n3-charts

Polymaps

Chartist.js

Matplotlib

Highcharts

ZoomData

ChartBlocks

Processing.js

FusionCharts Datawrapper Ember Charts Ember

Soft skills

Logic

Ethics

Initiative

Teamwork

Leadership

Negotiation

Coordination

Communication

expertise

Delivery of Delivery results

awareness

Information privacy Information

Specialist knowledge and knowledge Specialist Creative problem solving problem Creative Innovation and contextual Innovation

110

8.3 Appendix: Correlated Skills by Groups Data science tasks 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000

0

ios

php

unix

html

d3.js

nosql

html5

devops

node.js

hadoop

analysis

compiler

architect

mongodb

assurance

developer

leadership

simulation

sharepoint

unit testing unit

automation

data mining data

data analysis data

elasticsearch

virtualization

web analytics web

data modeling data

version control version

relational database relational

artificial intelligence artificial

businessintelligence

software engineering software software architecture software software development software

Administrative tasks for statistical purposes 18000 16000 14000 12000 10000 8000 6000 4000 2000

0

sql

ios

c++

mysql

jquery

analyst

devops

finance

asp.net

analysis

security

vmware

statistics

architect

selenium

database

hardware

postgresql

unit testing unit

automation

device driver device

virtualization

web analytics web

machine code machine

software design software

troubleshooting

machine learning machine

riskmanagement

data visualization data

software engineer software

relational database relational

businessintelligence product management product responsive web design web responsive

111

Budget tasks for statistical purposes 12000

10000

8000

6000

4000

2000

0

sql

c++

dba

php

dojo

linux

cloud

mysql

debian

java ee java

finance

robotics

compiler

database

metadata

leadership

monitoring

automation

data science data

data analysis data

web analytics web

machine code machine

version control version

user experience user

data conversion data

database design database

software engineer software

integration testing integration

businessintelligence

software architecture software

productmanagement responsive web design web responsive functionalprogramming

IT tasks for statistical purposes 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000

0

ios

dba

php

perl

mysql

design

mobile

matlab

javaee

finance

backend

statistics

angularjs

database

javascript

prototype

postgresql

linkeddata

automation

data mining data

data science data

elasticsearch

apachespark

user interface user

data modeling data

troubleshooting

customersupport

scripting language scripting

software engineer software

integration testing integration

relational database relational amazon web amazon services continuous integration continuous

112

Statistics and business intelligence 12000

10000

8000

6000

4000

2000

0

dba

html

soap

sales

nosql

cloud

mysql

sqoop

oracle

javaee

finance

analysis

security

robotics

pentaho

statistics

analytics

cloudera

database

postgresql

simulation wordpress

automation

web analytics web

user interface user

data modeling data

version control version

data conversion data

business objects business

machine learning machine

businessintelligence

productmanagement distributed computing distributed software development software

Visualization technologies 90 80 70 60 50 40 30 20 10

0

c++

xml

perl

.net

pmp

scipy

oracle

jquery

github

design

ember

joomla

numpy

matlab

analyst

node.js

jasmine android

analysis

network

statistics

angularjs

selenium

javascript mongodb

postgresql

data mining data

user interface user

machine code machine

.netframework

computerscience

data management data

businessintelligence

project management project

amazon web amazon services model view controller view model functionalprogramming

113

Search technologies 1200

1000

800

600

400

200

0

ios

php

perl

json

ruby

redis

cloud

mysql

hbase

scrum

oracle

design

lucene

devops

node.js

android

database

mongodb

hardware

metadata

postgresql

unit testing unit

virtualization

machine code machine

data modeling data

version control version

.netframework

software design software

data warehouse data

reverseengineering

artificialintelligence

distributed computing distributed software development software continuous integration continuous

Programming languages 16000

14000

12000

10000

8000

6000

4000

2000

0

git

sql

ruby

bash

redis

nginx

nosql

xhtml

debian

javaee

python

puppet

asp.net

security

vmware

statistics

analytics

compiler

selenium

javascript

hibernate

openstack

automation

virtualization

bioinformatics

.netframework

data warehouse data

machine learning machine

scripting language scripting

software engineer software

project management project

software engineering software distributed computing distributed software development software

114

Hadoop 4000 3500 3000 2500 2000 1500 1000 500

0

pig

c++

php

perl

unix

html

linux

nosql

html5

mysql

hbase

javaee

ansible

analyst

analysis

statistics

angularjs

database

javascript

cassandra

developer

leadership

monitoring

mapreduce

apachespark

machine code machine

virtualmachine

software design software

data warehouse data

businessobjects

data visualization data

relational database relational functionalprogramming complexprocessing event

Upper level technologies 30000

25000

20000

15000

10000

5000

0

ios

perl

java

html5

oracle

javaee

node.js

analysis

analytics

compiler

architect

javascript mongodb

metadata

assurance

simulation

automation

data mining data

elasticsearch

device driver device

virtualization

web analytics web

user interface user

troubleshooting

data warehouse data

machine learning machine

imageprocessing

computerscience

scripting language scripting

integration testing integration

amazon web amazon services

productmanagement continuous integration continuous functional programming functional

115

Databases technologies 120000

100000

80000

60000

40000

20000

0

c++

dba

php

json

html

bash

sybase

debian

matlab

javaee

devops

node.js

analysis

backend

statistics

compiler

database

metadata

simulation

wordpress

linkeddata

unit testing unit

data mining data

apachespark

user interface user

data modeling data

troubleshooting

data warehouse data

database design database

machine learning machine

data management data

integration testing integration software engineering software software architecture software

Data mining tools 450 400 350 300 250 200 150 100 50

0

sql

c++

xml

perl

unix

linux

html5

numpy

android

vmware

polymer

statistics

analytics

compiler

magento

angularjs

selenium

assurance

wordpress

unit testing unit

automation

data analysis data

web analytics web

microservices

virtualmachine

user experience user

data conversion data

data visualization data

scripting language scripting

software engineer software

relational database relational

businessintelligence

software engineering software software architecture software productmanagement

116

Data management technologies 3000

2500

2000

1500

1000

500

0

html

cloud

hbase

sqoop

design

lucene

ansible

python

vmware

pentaho

analytics

cloudera

architect

database

rabbitmq

mongodb

cassandra

leadership

mapreduce

elasticsearch

microservices

apachecamel

machine code machine

version control version

virtualmachine

machine learning machine

imageprocessing

data visualization data

scripting language scripting

software engineer software

data management data

businessintelligence amazon web amazon services

functionalprogramming

Cloud technologies 2500

2000

1500

1000

500

0

git

java

html

bash

nosql

oracle

design

docker

devops

node.js

firewall

hadoop

architect

database

hardware

openstack

postgresql

amazon s3 amazon

mapreduce

data analysis data

microservices

machine code machine

.netframework

high availability high

software design software

troubleshooting

scripting language scripting

data management data

integration testing integration

relational database relational

continuous delivery continuous

project management project distributed computing distributed software development software

117

Architecture 35000

30000

25000

20000

15000

10000

5000

0

pig

c++

xml

php

json

redis

nosql

html5

storm

jquery

design

debian

analyst

finance

hadoop

security

robotics

architect

database

mongodb

assurance

openstack

postgresql

monitoring

automation

data modeling data

version control version

software design software

machine learning machine

data management data

integration testing integration

reverseengineering artificial intelligence artificial project management project

Soft skills 4500 4000 3500 3000 2500 2000 1500 1000 500

0

git

ios

c++

xml

php

.net

nosql

cloud

html5

debian

mobile

javaee

asp.net

security

statistics

analytics

database

mongodb

hardware

prototype

openstack

leadership

linkeddata

user interface user

software design software experience user

machine learning machine

riskmanagement

computerscience

software engineer software

data management data

amazon web amazon services software architecture software

functionalprogramming

118

8.4 Appendix: Correlated Skills for Statistical Tools and Technologies

Adobe Analytics 200 180 160 140 120 100 80 60 40 20

0

3d

git

ios

xml

perl

java

html

linux

arcgis

hybris

oracle

drupal

design

debian

firewall

tableau

analysis

pentaho

compiler

magento

database

metadata

prototype

coldfusion

postgresql

wordpress

automation

web crawler web

data analysis data

.netframework

businessobjects artificialintelligence

functionalprogramming

Alteryx 30

25

20

15

10

5

0

sql

c++

perl

mysql

design

matlab

analysis

robotics

statistics

analytics

cloudera

database

metadata

leadership

simulation

data mining data

data science data

data analysis data

apachespark

data modeling data

data warehouse data

amazon redshift amazon

machine learning machine

data visualization data

data management data

relational database relational

reverseengineering

artificialintelligence

businessintelligence project management project software development software

119

Apache Spark 450 400 350 300 250 200 150 100 50

0

sql

c++

perl

unix

html

d3.js

laser

sales

redis

lamp

nosql

cloud

html5

sqoop

impala

python

apache

asp.net

hadoop

analysis

statistics

analytics

database

javascript

cassandra

assurance

prototype

postgresql

asp.net mvc asp.net

semantic web semantic

software design software

data warehouse data

artificialintelligence

software architecture software productmanagement software development software

BusinessObjects 180 160 140 120 100 80 60 40 20

0

perl

.net

html

nosql

oracle

owasp

finance

analysis

pentaho

analytics

database

metadata

leadership

mapreduce

data science data

data analysis data

.netframework

software design software experience user

database design database

businessobjects

machine learning machine

data visualization data

data management data

businessintelligence

project management project

software architecture software distributed computing distributed software development software

120

Cognos 300

250

200

150

100

50

0

etl

php

perl

unix

html

nosql

cloud

neo4j

mysql

hadoop

ipython

analytics

cloudera

compiler

architect

selenium

database

metadata

peoplesoft

sharepoint

linked data linked

monitoring

data modeling data

data conversion data

data warehouse data

computerscience

data management data

relational database relational

artificialintelligence

businessintelligence

amazon web amazon services software engineering software productmanagement

Excel 1200

1000

800

600

400

200

0

r

sql

x86

xml

.net

java

json

html

linux

stata

nosql spark

cloud

oracle

debian

python

android

statistics

postgresql

peoplesoft

monitoring

web analytics web

user interface user

troubleshooting

database design database

customersupport

scripting language scripting

relational database relational

reverseengineering

artificialintelligence

project management project

amazon web amazon services distributed computing distributed software development software

121

Jaspersoft 8

7

6

5

4

3

2

1

0

sql

xml

perl

.net

soap

sales

mysql

javaee

analyst

asp.net

statistics

database

data mining data

data analysis data

user interface user

.netframework

machine learning machine

data management data

reverseengineering

artificialintelligence

businessintelligence

software engineering software productmanagement software development software

Mathematica 30

25

20

15

10

5

0

sql

c++

perl

laser

nosql

mysql

design

matlab

python

analysis

statistics

database

mongodb

metadata

leadership

data mining data

data science data

data analysis data

machine code machine

machine learning machine computerscience

software development software

122

Matlab 400

350

300

250

200

150

100

50

0

r

tcl

git

ios

java

laser

sales

scipy

xilinx

spark nosql

boost

mysql

oracle

design

debian

directx

android

polymer

analytics

architect

javascript

prototype

developer

automation

data analysis data

device driver device

user interface user

machine code machine

grid computing grid

.netframework

database design database

machine learning machine

imageprocessing

computerscience software engineer software

project management project

Microstrategy 120

100

80

60

40

20

0

sql

html

sales

nosql

sqoop

github

design

fortran

analysis

statistics

analytics

compiler

sharepoint

data science data

businessobjects

software engineer software

relational database relational

reverse engineering reverse

artificialintelligence

businessintelligence

project management project

software engineering software software architecture software productmanagement

123

Oracle BI 1600

1400

1200

1000

800

600

400

200

0

jira

c++

dba

java

json

cloud

html5

jquery

drupal

mobile

matlab

javaee

devops

asp.net

android

analysis

vmware

analytics

architect

angularjs

database

hibernate

mapreduce

virtualization

version control version

data visualization data

computerscience

data management data

relational database relational

artificialintelligence

businessintelligence

software engineering software software development software functionalprogramming

Pentaho 70

60

50

40

30

20

10

0

sql

sas

c++

xml

php

perl

storm

sqoop

jquery

design

python

hadoop

analysis

symfony

analytics

database

mongodb

developer

leadership

mapreduce

data mining data

data science data

elasticsearch

.netframework

data visualization data

businessintelligence

project management project

amazon web amazon services software engineering software software development software

124

Power BI 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000

0

r

ios

java

unix

html

sales

nosql

mysql

storm

hbase

sybase

fortran

javaee

devops

tableau

hadoop

analytics

architect

postgresql

sharepoint

data mining data

data science data

virtualization

development

web analytics web

version control version

machine learning machine

computerscience

relational database relational

businessintelligence

project management project

software architecture software distributed computing distributed responsive web design web responsive

PowerPivot 50 45 40 35 30 25 20 15 10 5

0

sql

perl

mysql

finance

analysis

statistics

compiler

database

leadership

mapreduce

automation

data mining data

data analysis data

device driver device

web analytics web

data modeling data

data warehouse data

data visualization data

reverseengineering

artificialintelligence

businessintelligence software development software functionalprogramming

125

QlikView 140

120

100

80

60

40

20

0

sql

pig

c++

json

html

storm

design

finance

robotics

pentaho

database

leadership

mapreduce

automation

data mining data

web analytics web

user interface user

.netframework

software design software

computerscience

data management data

artificialintelligence

productmanagement software development software

complexprocessing event

R 60

50

40

30

20

10

0

c++

perl

java

d3.js

linux

nosql

mysql

design

javaee

finance

asp.net

analysis

robotics

statistics

compiler

architect

javascript

hardware

metadata

assurance

simulation

sharepoint

linkeddata

data mining data

data analysis data

device driver device

virtualization

development

cryptography

version control version

machine learning machine

riskmanagement

artificialintelligence businessintelligence

functionalprogramming

126

SAS 600

500

400

300

200

100

0

c++

x86

jade

linux

sales

sybase

impala

fortran

node.js

finance

analysis

security

atlassian

statistics

database

leadership

confluence

automation

data analysis data

apachespark

bioinformatics

.netframework

machine learning machine

riskmanagement

regression testing regression

scripting language scripting

relational database relational

amazon web amazon services

software engineering software distributed computing distributed software development software

SPSS 900 800 700 600 500 400 300 200 100

0

r

sas

.net

html

scipy

spark nosql

scrum

scrapy

design

matlab

analysis

security

statistics

analytics

database

mongodb

javascript

metadata

monitoring

development

web analytics web

machine code machine

bioinformatics

.netframework

businessobjects

riskmanagement

computerscience

data management data software architecture software software development software

127

Stata 70

60

50

40

30

20

10

0

r

sql

c++

html

scipy

arcgis

mysql

design

numpy

matlab

finance

hadoop

analysis

security

vmware

robotics

statistics

compiler

database

metadata

data mining data

data science data

data analysis data

machine code machine

data modeling data

virtualmachine

machine learning machine

data visualization data

computerscience

scripting language scripting

data management data reverse engineering reverse project management project

128

8.5 Appendix: Popularity of Tools and Technologies from Literature Analysis

Statistics and BI

Search Tools and Technologies

search-based applications ElasticSearch Solr Lucene

129

Programming Languages

Data Mining

130

Hadoop

Apache MapReduce Apache HDFS Hadoop YARN RHIPE Cloudera

Upper-level Technologies

131

Databases

Architecture

132

8.6 Appendix: Big Data Training Needs Questionnaire Form BIG DATA TRAINING NEEDS

QUESTIONNAIRE

Q1. What skills need to be acquired?

Q2. What data sources should be covered? (Sensor data, Social media data, Financial transaction data, Web-scraped data etc.)

Q3. How many staff members need training?

133

Q4. Are the training needs the same for all types of staff members? (Statistician in a particular domain, General statistical methodologist, IT expert etc.)

Q5. By when do they need it?

Q6. What are the priorities? (with respect to skills, data sources, time constraints, training delivery methods etc.)

Thank you. That is the end of the questions.

134

8.7 Appendix: Learning outcomes defined for CF-DS competences and different mastery/proficiency levels LO ID Data Science LO by Knowledge levels (compliant to ACM classification) and Competence key verbs Familiarity Usage Assessment Choose, Classify, Apply, Analyze, Adapt, Assess, Collect, Compare, Build, Construct, Change, Configure, Contrast, Develop, Combine, Define, Demonstrate, Examine, Compile, Describe, Execute, Experiment with, Compose, Explain, Find, Identify, Infer, Conclude, Identify, Illustrate, Inspect, Model, Criticize, Create, Label, List, Match, Motivate, Decide, Deduct, Name, Omit, Organize, Select, Defend, Design, Operate, Outline, Simplify, Solve, Discuss, Recall, Rephrase, Survey, Test for, Determine, Show, Summarize, Visualize Disprove, Tell, Translate Evaluate, Imagine, Improve, Influence, Invent, Judge, Justify, Optimize, Plan, Predict, Prioritize, Prove, Rate, Recommend, Solve Data Science Data Analytics (DSDA) LO1-DA DSDA-DA Choose appropriate Develop data Create formal Use existing analytical analysis model for the appropriate method and operate application for specific data analytics existing tools to do specific data sets organizational and statistical specified data and tasks or tasks and techniques on analysis. Present processes. processes and available data data in the required Identify necessary use it to to discover form. methods and use discover hidden new relations them in relations, and deliver combination if propose insights into necessary. optimization research Identify relations and problem or and provide improvements. organizational consistent reports Develop new processes and and models and support visualizations. methods if decision- necessary. making. Recommend and influence organizational

135

improvement based on continuous data analysis.

LO1.01 DSDA01 Choose and execute Identify existing Design and Effectively use existing data requirements and evaluate variety of data analytics and develop predictive predictive analytics predictive analytics analysis tools. analysis tools to techniques, tools. discover new such as relations. Machine Learning (including supervised, unsupervised, semi- supervised learning), Data Mining, Prescriptive and Predictive Analytics, for complex data analysis through the whole data lifecycle LO1.02 DSDA02 Choose and execute Select most Assess and Apply standard methods appropriate optimize designated from existing statistical organization quantitative statistical libraries to techniques and processes using techniques, provide overview. model available statistical including data to deliver techniques. statistics, time insights. series analysis, optimization, and simulation to deploy appropriate models for analysis and prediction LO1.03 DSDA03 Operate tools for Analyze available Assess, adapt, Identify, complex data data sources and and combine extract, and handling. develop tool that data sources to pull together

136

available and work with improve pertinent complex datasets. analytics heterogeneous data, including modern data sources such as social media data, open data, governmental data LO1.04 DSDA04 Name and use basic Use multiple Evaluate and Understand performance performance and recommend the and use assessment metrics accuracy metrics, most different and tools. select and use appropriate performance most appropriate metrics, propose and accuracy for specific type of new for new metrics for data analytics applications. model application. validation in analytics projects, hypothesis testing, and information retrieval LO1.05 DSDA05 Define data elements Develop Design Develop necessary to develop specialized specialized required data specified data analytics to analytics to analytics for analytics. enable decision- improve organizational making. decision-making. tasks, integrate data analytics and processing applications into organization workflow and business processes to enable agile decision making LO1.06 DSDA06 Choose and execute Build Create and Visualise standard visualizations for optimize results of data visualization. complex and visualizations to analysis, design variable data. influence dashboard and

137

use storytelling executive methods decisions. Data Science Engineering LO2-ENG DSENG - Use Identify and operate Model problems Evaluate engineering instruments and and develop new instruments and principles and applications for data instruments and applications to modern collection, analysis applications for optimize data computer and management data collection, collection, technologies to analysis and analysis and research, management management. design, following implement established new data engineering analytics principles. applications; develop experiments, processes, instruments, systems, infrastructures to support data handling during the whole data lifecycle. LO2.01 DSENG01 Choose potential Model data Create Use technologies to analytics innovative engineering develop, structure, application to solution to principles instrument, better develop research and (general and machines, suitable design data software) to experiments, instruments, analytics research, processes, and machines, design, develop systems. experiments, and implement processes, and new systems. instruments and applications for data collection, storage, analysis and visualisation LO2.02 DSENG02 Name computational Apply existing Adapt and Develop and solution and identify computational optimize existing apply potential data solutions to data computational computational analytics platform analytic platform. solutions to and data driven better fit to a solutions to given data

138

domain related analytics problems using platform. wide range of data analytics platforms, with the special focus on Big Data technologies for large datasets and cloud based data analytics platforms LO2.03 DSENG03 Identify a set of Survey various Evaluate and Develop and potential data specialized data recommend prototype analytics tools to fit analytics tools and optimal data specialised specification. identify the best analytics tools to data analysis option. influence applicaions, decision making. tools and supporting infrastructures for data driven scientific, business or organisational workflow; use distributed, parallel, batch and streaming processing platforms, including online and cloud based solutions for on-demand provisioned and scalable services LO2.04 DSENG04 Find possible Model the Predict the Develop, database solutions problem to apply difference in deploy and including both database term of operate large relational and non- technology. performance scale data relational databases. between storage and relational and processing non-relational

139

solutions using databases and different recommend a distributed and solution. cloud based platforms for storing data (e.g. Data Lakes, Hadoop, Hbase, Cassandra, MongoDB, Accumulo, DynamoDB, others) LO2.05 DSENG05 Identify security Analyze security Evaluate security Consistently issues related to threats and solve threats and apply data reliable data access. them using known recommend security techniques. adequate mechanisms solutions. and controls at each stage of the data processing, including data anonymisation, privacy and IPR protection. LO2.06 DSENG06 Define technical Apply existing Combine several Design, build, requirements for SQL/NoSQL techniques and operate SQL/NoSQL databases, Data optimize them relational and databases, Data Warehouse to design new or non-relational Warehouse technologies for custom databases (SQL technologies for data creating data environment to and NoSQL), ingest. pipelines. integrate integrate them existing DW and with the database modern Data technologies for Warehouse new type of data solutions, and analytic ensure applications. effective ETL (Extract, Transform, Load), OLTP, OLAP processes for large datasets Data Science Data Management (DSDM)

140

LO3-DM DSDM-DM Execute data Develop Create Data Develop and strategy in a form of components of Management implement Data Management data strategy and Plan aligned data Plan and illustrate methods that with the management how available improve quality, organizational strategy for software can help to accessibility and needs, evaluate data collection, promote data quality publications of IPR and ethical storage, and accessibility. data. issues. preservation, and availability for further processing. LO3.01 DSDM01 - Explain and execute Develop Assess various Develop and data strategy in a components of data strategies implement form of Data data strategy in a and create data strategy, Management Plan. form of Data strategy, in a in particular, in Management form of Data a form of Data Plan. Management Management Plan, aligned Plan (DMP). with organizational needs. LO3.02 DSDM02 - Operate data models Experiment with Evaluate and Develop and including metadata. data models and design data implement model relevant models, relevant data metadata. including models, metadata. including metadata. LO3.03 DSDM03 - Collect different data Survey and Compose Collect and sources. visualize different data integrate connection sources to different data between different enable further source and data sources. analysis. provide them for further analysis. LO3.04 DSDM04 - Operate a historical Construct a Improve or Develop and data repository. historical data design a maintain a repository. historical data historical data repository. repository of analysis results (data provenance). LO3.05 DSDM05 - Illustrate how Develop methods Improve quality, Ensure data available software that improve accessibility and quality, can help to promote quality,

141

accessibility, data quality, accessibility and publications of publications accessibility and publications of data. (data curation). publications. data. Data Science Research Methods and Project Management (DSRMP) LO4-RMP DSRM Match elements of Apply scientific or Evaluate Create new scientific or similar similar method methodologies understandings method and identify and develop to optimize the and capabilities appropriate actions action plans to development of by using the for organizational translate organizational scientific strategy to create organizational objectives. method new capabilities. strategies to (hypothesis, create new test/artefact, capabilities. evaluation) or similar engineering methods to discover new approaches to create new knowledge and achieve research or organisational goals LO4.01 DSRM01 Match elements of Apply scientific Evaluate various Create new scientific or similar method to create methods and understandings method to a given a new predict which by using the problem understandings method can research and capabilities. optimize methods creation of new (including understandings hypothesis, and capabilities. artefact/experi ment, evaluation) or similar engineering research and development methods LO4.02 DSRM02 Choose observable Apply systematic Combine several Direct facts from an existing study toward a methods to systematic study for a better fuller knowledge discover new study toward understanding. or understanding approaches to understanding of the observable achieve of the facts. organizational observable goals.

142

facts, and discovers new approaches to achieve research or organisational goals LO4.03 DSRM03 Formulate and test Create full Analysis domain Analyse hypothesis for experiment to test related models domain related specified task or hypothesis for and propose research research question. domain specific analytics process model, task or methods, identify and experiment suggest new analyse data or improve available data quality of used to identify data. research questions and/or organisational objectives and formulate sound hypothesis LO4.04 DSRM04 Show creativity Develop creative Adapt common Undertake under guidance of a solutions using systematic creative work, senior staff in systematic investigation to making discovering and investigation or design and plan systematic use revising knowledge. experimentation creative work to of investigation to revise and discover or or discover revise experimentatio knowledge. knowledge. n, to discover or revise knowledge of reality, and uses this knowledge to devise new applications, contribute to the development of organizational objectives

143

LO4.05 DSRM05 Illustrate outstanding Identify non- Recommend Design ideas to solve standard solutions cost effective experiments complex problems. to solve complex solution to a which include problems. complex data collection problem. (passive and active) for hypothesis testing and problem solving LO4.06 DSRM06 Identify appropriate Develop actions Recommend Develop and actions for a given and action plan to effective action guide data project plan or translate plans to driven projects, experiment. strategies into translate including actionable plan. strategies, project suggest new planning, data to improve experiment effectiveness. design, data collection and handling Business Process Management LO5-BA DSDK Match elements of a Model business Evaluate various Use domain mathematical problems into an methods to knowledge framework to a abstract predict which (scientific or given business mathematical method can business) to problem and operate framework and optimize solving develop data support identify critical business relevant data services for other points which problems and analytics organizational roles. influence recommend applications; development of strategies that adopt general organizational optimize the Data Science objectives. development of methods to organizational domain objectives. specific data types and presentations, data and process models, organisational roles and relations

144

LO5.01 DSBA01 Match elements of a Model an Evaluate various Analyse mathematical unstructured methods and information framework to a given business problem predict which needs, assess business problem. into an abstract method can existing data mathematical optimize solving and framework. business suggest/identif problems. y new data required for specific business context to achieve organizational goal, including using social network and open data sources LO5.02 DSBA02 Match data to Analyze services Assess and Operationalise specification of to develop data improve use of fuzzy concepts services. specification. data in services. to enable key performance indicators measurement to validate the business analysis, identify and assess potential challenges LO5.03 DSBA03 Identify appropriate Identify critical Recommend Deliver actions for points which strategies that business management and influence optimize the focused organizational development of development of analysis using decisions. organizational organizational appropriate objectives. objectives. BA/BI methods and tools, identify business impact from trends; make business case as a result of organisational

145

data analysis and identified trends LO5.04 DSBA04 Operate data support Develop data Optimize data Analyse services for other support services support services opportunity organizational roles. for other for other and suggest organizational organizational use of roles. roles. historical data available at organisation for organizational processes optimization LO5.05 DSBA05 Summarize customer Survey and Recommend Analyse data. visualize customer actions based on customer data. data analysis to relations data improve to customer optimise/impro relations. ve interacting with the specific user groups or in the specific business sectors LO5.05 DSBA06 Access and use Identify data that Suggest new Analyse external open data bring value to marketing multiple data and social network used analytics for models based on sources for data. marketing. Use existing and marketing cloud based external data. purposes; solutions. identify effective marketing actions

146