Services concerning ethical, communicational, skills issues and methodological cooperation related to the use of Big Data in European statistics
(Contract number 11104.205.005-2015.799)
Development of a training strategy to bridge the big data skills gap in European official statistics
Version 2
Date of the 30 November 2017 report:
Drafted by: JOZEF STEFAN INSTITUTE Inna NOVALIJA
Marko GROBELNIK
Disseminated: EUROSTAT: Albrecht WIRTHMANN
1
Neither the European Commission nor any other person acting on behalf of the Commission is responsible for the use that might be made of the following information.
The information and views set out in this report are those of the author(s) and do not necessarily reflect the official opinion of the European Union. Neither the European Union institutions or bodies nor any person acting on their behalf may be held responsible for the use which may be made of the information contained therein.
2
Table of contents List of Tables ...... 5 List of Figures ...... 5 1. Executive summary ...... 7 2. Skills required to process and analyze big data sources for the purpose of official statistics ...... 8 2.1 Background ...... 8 2.1.1 Defining big data ...... 8 2.1.2 Related studies and initiatives ...... 9 2.1.2.1 EDSA ...... 9 2.1.2.2 BDVA reports ...... 11 2.1.2.3 Related MOOCs ...... 12 2.2 Data collection ...... 15 2.2.1 Methodology for data collection ...... 15 2.2.2 Data statistics ...... 17 2.2.3 Literature analysis ...... 18 2.3 Skills analysis ...... 18 2.3.1 Methodology for skills analysis ...... 18 2.3.2 Clustering skills with OntoGen ...... 19 2.3.3 Analysis by skills groups...... 20 2.4 Results ...... 25 2.4.1 Trending skills ...... 26 2.4.2 Correlated skills ...... 29 2.4.3 Skills from literature analysis ...... 33 2.5 Discussion ...... 33 3. Existing skills in the statistical offices of the ESS, Eurostat and NSIs ...... 37 3.1 Existing skills overview ...... 37 4. Analysis of the training needs of the statistical offices ...... 39 4.1 Big data training needs survey overview ...... 40 4.2 Big data training needs survey results ...... 40 4.3 Big data training needs survey summary ...... 49 4.4 Towards bridging the skills gap for big data in statistics ...... 52 5. Training objectives and content for the design of a training program ...... 52 5.1 Learning models and curriculum design approaches ...... 52 5.1.1 Learning models ...... 52
3
5.1.2 Curricula guidelines ...... 55 5.2 Related curricula and classifications ...... 55 5.2.1 ACM classification for computer science ...... 55 5.2.2 Curriculum development in EDISON project ...... 56 5.2.3 Curriculum development in EDSA project ...... 57 5.3 Training objectives for statistical offices in Europe in the area of Big Data ...... 61 5.3.1 Defining training objectives ...... 61 5.4 Content development in the area of Big Data...... 71 5.4.1 ESTP content ...... 71 5.4.2 Data science content dashboard ...... 73 6. Strategic analysis of bridging the gap via training ...... 85 6.1 Training channels ...... 85 6.1.1 Advantages and disadvantages of face-to-face training ...... 85 6.1.2 European Statistics Training Programme ...... 86 6.1.3 European Master in Official Statistics ...... 86 6.1.4 Online Learning ...... 86 6.1.5 Other possible training channels ...... 89 6.2 Defining strategic training plan for Big Data in official statistics...... 90 6.2.1 ADDIE model ...... 90 6.2.2 Analysis ...... 90 6.2.3 Design ...... 91 6.2.4 Development ...... 91 6.2.5 Implementation ...... 92 6.2.6 Evaluation ...... 93 7. References ...... 94 8. Annex ...... 97 8.1 Appendix: Recommended Literature Sources in the Area of Big Data ...... 97 8.2 Appendix: Trending Skills by Groups ...... 103 8.3 Appendix: Correlated Skills by Groups ...... 111 8.4 Appendix: Correlated Skills for Statistical Tools and Technologies ...... 119 8.5 Appendix: Popularity of Tools and Technologies from Literature Analysis...... 129 8.6 Appendix: Big Data Training Needs Questionnaire Form ...... 133 8.7 Appendix: Learning outcomes defined for CF-DS competences and different mastery/proficiency levels ...... 135
4
List of Tables Table 1: Running/Planned Big Data MOOCs ------12 Table 2: Number of Job Postings by Country ------17 Table 3: Skill Groups (Soft Skills, Tasks and Methods) ------21 Table 4: Skill Groups (Tools and Technologies) ------23 Table 5: Highly Demanded Skills (Soft Skills, Tasks and Methods) ------27 Table 6: Highly Demanded Skills (Tools and Technologies) ------28 Table 7: Emerging Skills ------29 Table 8: Skills Classification as Basis for Questionnaire ------33 Table 9: Skills Based on LinkedIn Experiment According to Skills Framework ------38 Table 10: Skills from Big Data Training Needs Survey According to Skills Framework ------50 Table 11: Knowledge Levels for Learning Outcomes in Data Science Model Curricula (MC-DS) ------57 Table 12: Core EDSA Curriculum, version 1 ------58 Table 13: Core EDSA Curriculum, version 3 ------59 Table 14: Recommendations for EDSA Curriculum Development ------60 Table 15: Training Objectives Mapped to Big Data Training Needs ------61 Table 16: Related ESTP Courses ------71 Table 17: ESTP Content Mapped to Training Objectives ------72 Table 18: Web Content Mapped to Training Objectives ------74 Table 19: Videolectures in the Area of Big Data------87 Table 20: Expected Outcomes for Big Data in Statistics ------90
List of Figures Figure 1: Big Data MOOCs Skills ------15 Figure 2: Data Acquisition and Enrichment Pipeline ------16 Figure 3: Example of Job Postings Wikification with JSI Wikifier ------16 Figure 4: Top Locations for Data Analytics in Europe ------17 Figure 5: Jobs Posting Content Clustering with OntoGen ------19 Figure 6: Jobs Postings Skills Clustering with OntoGen ------20 Figure 7: Skills by Groups ------21 Figure 8: Technologies, Tasks and Soft Skills Trends ------26 Figure 9: Statistical Tasks and Methods ------26
5
Figure 10: Tools and Technologies Trends by Groups ------27 Figure 11: Correlated Skills for Statistical Tasks ------30 Figure 12: Correlated Skills for Statistics and Business Intelligence Skill Group ------30 Figure 13: Correlated Skills for the Excel Skill ------31 Figure 14: Correlated Skills for the SAS Skill ------32 Figure 15: Correlated Skills for the Matlab Skill ------32 Figure 16: Tools and Technologies by Groups from Literature Analysis ------33 Figure 17: LinkedIn Experiment ------38 Figure 18: Big Data Training Needs Survey - Answers Map------40 Figure 19: Bloom’s Taxonomy ------53 Figure 20: EDSA Dashboard – Hadoop Search ------73 Figure 21: EDSA Dashboard – Hadoop Related Trainings and Videolectures ------74 Figure 22: EDSA Online Courses Portal ------88 Figure 23: EDSA Learning Pathways ------89
6
1. Executive summary This document outlines the findings and recommendations related to:
- the identification of skills required for the use of big data sources, - the analysis of training needs from the statistical offices of ESS and - the definition of training objectives and content in the area of big data for official statistics.
The results from the skill analysis give a clear indication about individual big data, data science and statistical skills as well as skill groups that are currently in demand all over Europe. The report addresses skill groups at different levels of the big data Skills Framework, such as soft skills, tasks and methods (statistical and data science tasks, administrative tasks for statistical purposes, information technologies tasks for statistical purposes), tools and technologies (big data skills related to platform architecture, big data skills related to statistics and business intelligence, big data skills related to data management, big data skills related to cloud technologies, big data skills related to data mining tools, big data skills related to databases, big data skills related to Hadoop technology, big data skills related to programming languages, big data skills related to search, big data skills related to visualization technologies, upper level data science skills).
The skill analysis shows growing trends in different big data skill groups and suggests that there is a great need for big data training. There are a number of established skills and skill areas where demand is high across Europe, including Apache-related skills, Databases, Programming languages, such as Python and Javascript, Cloud technologies, Data mining etc. Other skills, such as data visualisation and a number of data management technologies are emerging trends with steady growth, but not yet to the high levels of other areas.
In order to identify big data training needs and obtain a vision about existing skills in NSIs, a survey targeted at big Data focal points in EU countries was conducted. The report describes the survey results as well as the analysis of existing skills in the ESS according to big data skills framework for official statistics.
In particular, the survey defined the groups of skills that NSIs would like to acquire:
- Methodological skills; - Technical skills; - Visualization and storytelling skills; - Contextual skills and - Soft skills.
Several data types/data sources, such as web-scrapped data, mobile phone data, sensor data, scanner data etc. have been frequently listed as priorities. The survey defined that training should be performed at different levels (introductory and advanced). Training can be targeted at different profiles of the employees. Individuals and big data teams can receive training. Training should take into account the big data sources and types that would be addressed in ESS.
The minimum and maximum number of trainees varies depending on the size of the NSI and the NSI training strategy.
7
The priorities defined in the survey included:
- Priorities for training methods/knowledge transfer types (such as webinars, online courses along with face-to-face training); - Priorities for data sources/data types (webs-craped data and mobile phone data are frequently mentioned in the priorities); - Priorities for technologies/methods/skills (like introductory big data methodologies and skills); - Priorities for other issues (trainings up to 1 week, trainings on the job).
Furthermore, based on the survey results and skills required for operation with big data, a set of training objectives with corresponding content were identified. The report includes the definition of what the trainees should be able to do as a result of the training, to what standards and under what conditions. The training objectives and content defined in the report are ready to be used for the design of the effective training plan.
Finally, the report presents an analysis of the possible ways big data skills training can be provided to the staff of the statistical offices of the ESS.
Several existing tools, such as the ESTP (European Statistics Training Programme) and The European Master in Official Statistics (EMOS) are taken into consideration, as well as possible e-learning training mechanisms. In particular, options such as Webinars, MOOCs, Videolectures, Workshops and Personalized courses portals are suggested as feasible training channels.
The report highlights the necessity of using blended (face-to-face and online) training approaches in order to reach the training needs in the area of big data for official statistics.
2. Skills required to process and analyze big data sources for the purpose of official statistics
2.1 Background
2.1.1 Defining big data In an era of growing information technology, the data creation process is evolving in an easier and faster way than data analytics.
A big data analytics survey [1] states that data-driven innovations in European public and private sectors already bring benefits for businesses, government organizations and individual citizens.
Nada Elgendy and Ahmed Elragal [2] define “Big Data” as a term recently applied to datasets that grow so large that they become awkward to work with using traditional database management systems. Thus, big- data datasets go beyond the ability of commonly used software tools and storage systems to capture, store, manage, as well as process the data within a tolerable elapsed time [3]. At the same time, Paul MacDonnell and Daniel Castro [4] state that “Big Data” includes processing information that is often heterogeneous and frequently updated. Organizations can continuously aggregate data at a group level,
8 collecting data for all users and not only for a sample of users. Big data sizes are constantly increasing, currently ranging from a few dozen terabytes (TB) to many petabytes (PB) of data in a single data set.
According to a Data & Analytics Survey [5], more and more companies intend to implement data-driven projects and 53% of companies target at generating a greater value from existing data. Out of the projects that are underway or in the planning stages, 26% are already implemented, 14% are in the process of implementation or pilot testing and 13% are planning implementation in the next 12 months. Another 8% are considering a data-driven project, and 8% say they are likely to pursue one, although they are currently struggling to find the right strategy or solutions.
38% of respondents do not have plans to implement data-driven projects, but would still like to perform analytics on existing data. The listed data challenges include finding correlations across multiple disparate data sources (60%), predicting customer behaviour (47%), and predicting product or service sales (42%), identifying computer security risks, analysing high-scale machine data and predicting fraud and financial risk. At the same time, some respondent segment would like to pay more attention to analysing social media data.
Different organizations usually operate with different types of data collected. Enterprises are more likely to collect transactional data, machine-generated/sensor data, government and public domain data, and data from security monitoring. Smaller organizations collect email, data from third-party databases, social media, and statistics from news media.
Top data sources include sales and financial transactions (56%), leads and sales contacts from customer databases (51%), and email and productivity applications (39%).
One of the biggest challenges for data-driven projects is dealing with unstructured data (emails, word documents, presentations, etc.).
Dealing with sensor streams (coming from wearable medical devices, automated homes and intelligent roadways) now becomes a new trend. Internet of Things [4] is a term used to describe the set of physical objects embedded with sensors or actuators and connected to a network. The estimations say that by 2020 there will be around 50 billion sensor devices worldwide.
The necessity to work with big data produces training needs in this area. Several studies and initiatives in the area of big data skills analysis and big data training are described below in this deliverable.
2.1.2 Related studies and initiatives In recent years, a number of initiatives related to big data skill analysis and training appeared. Below, we describe the European Data Science Academy (EDSA) project and the Big Data Value Association (BDVA) reports related to demand analysis and we analyse a set of big data MOOCs available via online training portals.
2.1.2.1 EDSA
EDSA project overview
The European Data Science Academy (EDSA) [6] is a Horizon2020 project that
9
- Analyses the sector-specific skillsets for data analysts across Europe’s main industrial sectors; - Develops modular and adaptable curricula to meet these data science needs; and - Delivers training supported by multiplatform and multilingual learning resources based on these curricula.
EDSA curriculum provides a set of data science trainings (including trainings in the area of big data) for the following topics:
- Foundations of Data Science - Foundations of Big Data - Statistical / Mathematical Foundations - Programming / Computational Thinking (R and Python) - Data Management and Curation - Big Data Architecture - Distributed Computing - Stream Processing - Linked Data and the Semantic Web - Machine Learning, Data Mining and Basic Analytics - Big Data Analytics - Process Mining - Social Media Analytics - Data Visualisation and Storytelling - Data Exploitation, including data markets and licensing.
An important task of the EDSA project is monitoring data science trends to assess the demands for particular skills and expertise in Europe. EDSA partners have developed dashboards to present the current state of the European data science landscape, with the data feeding into the development of curricula by using interviews with data science practitioners, an industry advisory board representing a mix of sectors, and automated tools for extracting data about job posts and news articles.
The EDSA plans to align demand with supply of training materials in data science.
Training Delivery and Learning Analytics
Training delivery in EDSA is performed through eBooks, MOOCs, video lectures and face-to-face training. EDSA partners are working on integrated learning pathways, translated into European languages and expanded to meet the requirements for specific sectors as indicated by our demand analysis.
For learning analytics, EDSA partners are using VideoLectures.NET and FutureLearn – the largest European MOOC platform, founded by The Open University – to maximise outreach and uptake of our materials.
EDSA skills ontology
The Skills and Recruitment Ontology (SARO) [7] is a domain representing occupations, skills and recruitment. It is modelled by considering several similar context models, but is mainly inspired by the
10
European Skills, Competences, Qualifications and Occupations ontology (ESCO) and Schema.org. The ontology is structured along four dimensions: job posts, skills, qualifications and users.
Job posts refer to job advertisements by organizations. Advertised job openings comprise various essential attributes, such as the job role, title, the relevant sector and other related descriptions (defined by Schema.org, e.g. job location, date posted, working hours, etc.).
One of the most important job requirement that is usually explicitly defined is the list of qualifications fitting for this role, including fundamental skills which are required to fulfil the role. SARO also describes the proficiency level for each skill.
Skills are harnessed by a group of users based on their tasks. For example, an educator or trainer could develop training resources related to certain skills or competences. In order to do so, a specific skill can be chosen by considering the skill gap of another user group, e.g. the domain specialist.
2.1.2.2 BDVA reports The BDVA reports [8] section contains a list of references to interesting reports/white papers/articles related to big data.
For instance, O'Reilly's 2016 Data Science Salary Survey [9] presents results from 983 respondents, working across a variety of industries, who answered questions about the tools they use, the tasks they engage in, and the salaries they make. The 2016 survey includes data scientists, engineers, and others in the data space from 45 countries and 45 US states.
O'Reilly's 2016 Data Science Salary Survey suggests the following tasks for Data Scientist: - Basic Exploratory Data Analysis - Conducting Data Analysis to Answer Research Questions - Communicating Findings to Business Decision-makers - Data Cleaning - Creating Visualizations - Identifying Business Problems to be Solved with Analytics - Feature Extraction - Developing Prototype Models - Organizing and Guiding Team Projects - Implementing Models/Algorithms into Production - Collaborating on Code Projects - Teaching/Training Others - Planning Large Software Projects or Data Systems - Developing Dashboards - ETL - Communicating with People outside your Company - Setting up/managing Data Platforms - Developing Data Analytics Software - Developing Products that Depend on Real-Time Data Analytics - Using Dashboards and Spreadsheets to Make Decisions - Developing Hardware
11
The survey states that coding is an important part of a data scientist's job. Python and Spark are among the tools that contribute most to salary. The top two tools in the sample were Excel and SQL, both used by 69% of the sample, followed by R (57%) and Python (54%).
Different skill groups from the O'Reilly survey include the following popular Data Scientist tools:
Programming languages: SQL, Python, R, JavaScript, Go, Octave, Ruby, SAS, Perl, C#, C, Scala, Matlab, C++, Visual Basic/VBA, JavaScript, Java, Bash.
Relational databases: MySQL, Oracle Exascale, Redshift, SAP HANA, Aster Data (Teradata), EMC/Greenplum, Netezza (IBM), Vertica, IBM DB2, Teradata, SQLite, PostgreSQL, Oracle, Microsoft SQL Server.
Hadoop: EMC / Greenplum, Oracle, MapR, Amazon Elastic MapReduce (EMR), Hortonworks, Cloudera, Apache Hadoop, IBM.
Search: Solr, ElasticSearch, Lucene.
Data Management, Big Data Platforms: Couchbase, Storm, Amazon DynamoDB, Splunk, Google BigQuery/Fusion Tables, Neo4J, Redis, Zookeeper, Cassandra, Toad, Impala, Pig, Hbase, Amazon RedShift, MongoDB, Hive.
Spreadsheets, Business Intelligence, Reporting: Excel, Jaspersoft, Alteryx, Microstrategy, Adobe Analytics, Pentaho, Oracle BI, Cognos, BusinessObjects, QlikView, Power BI, PowerPivot, Spark.
Visualization Tools: JavaScript InfoVis Toolkit, Processing, Bokeh, Google Charts, D3, Shiny, Matplotlib, Tableau, ggplot.
Machine Learning, Statistics: IBM Big Insights, BigML, Vowpal Wabbit, KNIME, Dato / GraphLab, Stata, Mathematica, Mahout, LIBSVM, RapidMiner, H2O, Weka, Spark MlLib, Scikit-learn, Google Prediction.
The O'Reilly report is a useful source of infomation that reflects the grouping of skills and tools for data science in general and big sata in particular.
2.1.2.3 Related MOOCs Following the high demand for big data skills and the growing trend of big data technologies, we have studied the available online big data training resources in the form of MOOCs. Table 1 presents running and planned MOOCs in the area of big data. As shown in Table 1, the content of MOOCs varies – from general MOOCs on the topic of big data and data science to more specific MOOCs in the areas of Bioinformatics, Internet of Things (IoT) etc.
Table 1: Running/Planned Big Data MOOCs MOOC TITLE TOPICS AND TAGS Accounting Analytics (Coursera) Business & Management, Statistics & Data Analysis Business Analytics, Accounting Analytics, Financial Performance, Forecasting, Prediction Models, Big Data, Non- financial Metrics
12
Foundations of marketing analytics Business & Management, Statistics & Data Analysis (Coursera) Marketing Analytics, Business Analytics, Big Data, Databases, Data Analysis, Statistical Segmentation, Managerial Segmentation, Customer Hadoop Platform and Application Data Science, Statistics & Data Analysis Framework (Coursera) Big Data, Hadoop Platform, Data Analysis, Spark, Map- Reduce, Hadoop, Hadoop Stack, HDFS A Crash Course in Data Science Business & Management, Data Science, Statistics & Data (Coursera) Analysis Data Science, Big Data, Machine Learning, Statistics, Software Engineering Data-driven Decision Making Data Science, Statistics & Data Analysis (Coursera) Decision Making, Data Analytics, Big Data, Data Analysis, Business Big Data Integration and Processing Data Science, Statistics & Data Analysis (Coursera) Big Data, Data Integration, Processing, Data Science, Hadoop, Spark Big Data for Better Performance Marketing & Communication (Open2Study) Big Data, Marketing, Predictive Marketing Advanced Algorithms and Complexity Computer Science: Programming & Software Engineering, (Coursera) Computer Science: Theory Algorithms, Data Structures, Big Data, Machine Learning Introduction to Big Data (Coursera) Data Science, Statistics & Data Analysis Big Data, Data Science, Hadoop Graph Analytics for Big Data Data Science, Statistics & Data Analysis (Coursera) Big Data, Graph Analytics, Data Analysis, Graphs, Neo4j, GraphX Machine Learning With Big Data Data Science, Statistics & Data Analysis (Coursera) Big Data, Machine Learning, KNIME, Spark, Algorithms, Clustering Analysis Managing Big Data with MySQL Data Science, Statistics & Data Analysis (Coursera) Big Data, MySQL, Analytic Techniques, Databases, Business Analysis, Queries Big Data, Genes, and Medicine Biology & Life Sciences, Data Science, Information, (Coursera) Technology, and Design Big Data, Genes, Medicine, Bioinformatics, Genetics, Human Body Cloud Computing Applications, Part 2: Computer Science: Systems, Security, Networking Big Data and Applications in the Cloud Cloud Computing, Big Data, Applications, Cloud, Cloud (Coursera) Applications, Data Analysis, MapReduce, Spark, Cloudera, MapR, NOSQL Databases, HBase, Kafka, Spark Streaming, Lambda, Kappa, Graph Processing, Machine Learning, Deep Learning The Importance of Listening Data Science, Statistics & Data Analysis (Coursera) Social Media, Listening, Marketing, Big Data Big Data Modeling and Management Data Science, Statistics & Data Analysis Systems (Coursera)
13
Big Data, Modeling, Management Systems, Data Analysis, Analytical Tools, AsterixDB, HP Vertica, Impala, Neo4j, Redis, SparkSQL Processing Big Data with Hadoop in Computer Science: Programming & Software Engineering Azure HDInsight (edX) Big Data, Hadoop, Azure HDInsight, Microsoft Azure, Hive, Pig, Sqoop, Oozie, Mahout, R Language, Storm, HBase Implementing Real-Time Analytics Computer Science: Systems, Security, Networking with Hadoop in Azure HDInsight (edX) Hadoop, Azure HDInsight, Big Data, HBase, Storm, Spark, Microsoft Smartphone Emerging Technologies Computer Science: Programming & Software Engineering, (Coursera) Electronics, Engineering, Statistics & Data Analysis Emerging Technologies, SmartPhone, Smartphones, IoT, Internet of Things, Big Data, Operating Systems, iOS, Android Big Data Science with the BD2K-LINCS Biology & Life Sciences, Statistics & Data Analysis Data Coordination and Integration Big Data, Analysis, LINCS, Network Analysis, Bioinformatics Center (Coursera) Big Data Science with the BD2K-LINCS Biology & Life Sciences, Statistics & Data Analysis Data Coordination and Integration Big Data, Analysis, LINCS, Network Analysis, Bioinformatics Center (Coursera) Big Data, Cloud Computing, & CDN Computer Science: Systems, Security, Networking Emerging Technologies (Coursera) Cloud Computing, Big Data, CDN, Content Delivery Network, Emerging Technologies, Smartphones, IoT, Internet of Things
Internet Emerging Technologies Computer Science: Systems, Security, Networking (Coursera) Emerging Technologies, Internet, IoT, Internet of Things, Big Data, IP, Internet Protocol, IPv4, IPv6, TCP, UDP Internet of Things & Augmented Computer Science: Programming & Software Engineering, Reality Emerging Technologies Electronics, Engineering (Coursera) Emerging Technologies, Big Data, IoT, Internet of Things, Augmented Reality, AR, WSN, Wireless Sensor Network, M2M Python for Genomic Data Science Data Science, Statistics & Data Analysis (Coursera) Genomic Data, Python, Programming, Big Data
Figure 1 shows the most popular skills extracted from MOOCs, based on MOOC topics and tags. Some of the most popular skills represented in MOOC training are Data Analysis, HBase, Internet of Things, Bioinformatics, Cloud Computing and Algorithms.
14
MOOCs Skills
Marketing Engineering Cloud Computing Algorithms HBase Biology & Life Sciences Internet of Things Computer Science: Programming & Software Engineering Data Analysis Big Data 0 5 10 15 20 25 30 number of MOOCs
Figure 1: Big Data MOOCs Skills The data science MOOCs shown above and the training materials from the EDSA project represent the supply of training in the area of big data. Below, we present the methodology that enables to identify the demand of skills in the area of big data, with a particular emphasis on big data for statistics.
2.2 Data collection
2.2.1 Methodology for data collection In the skills analysis process, we have built a data acquisition and enrichment pipeline displayed in Figure 2. We have used the Adzuna API [10] to obtain a dataset of job postings related to data science from the UK, France, Germany and the Netherlands, and other crawling mechanisms for data provision from a variety of European countries, such as Denmark, Ireland, Romania, Italy, Switzerland, Belgium, Austria, Spain, Hungary, Sweden, Czech Republic, Poland and Portugal.
Adzuna is a search engine for job ads that operates websites in 11 countries, aggregating vacancies from different job portals into one storage unit.
The crawled data have been characterized by several important features, such as multi-linguality, representation in JSON form, cross-country view and the presence of particular geographical and time components. After obtaining relevant datasets, we perform a number of data enrichment processes, including wikification and geo enrichments. Wikification is an expression that relates to identifying and linking textual components to the disambiguated Wikipedia pages [11]. We have used the JSI Wikifier [12], which supports cross-linguality and multi-linguality functions that provided a possibility to annotate textual information about job postings in different languages with cross-lingual Wikipedia information.
15
Crawled RDF Data
multi-lingual original job info
Wikification Geo extracted skills cross-country Enrichment extracted Wiki concepts
json format
geo info
Figure 2: Data Acquisition and Enrichment Pipeline
Figure 3 provides a snapshot of the job posting wikification with JSI Wikifier. The results obtained from the Wikifer have been aligned with the skill ontology via name matching.
Following that, we have enriched data with concepts from the GeoNames ontology [13]. We have added the GeoNames location URI and location name to job postings where latitude and longitude were available. We have also added the coordinates and location URI to the postings where only the location name was available.
Figure 3: Example of Job Postings Wikification with JSI Wikifier
16
2.2.2 Data statistics Figure 4 presents a glance on top locations for Data Analytics in Europe. It is possible that our mechanisms cover the majority of European countries – which we consider to be particularly useful, since the collected data reflects the skill specifics both on European and national levels.
Figure 4: Top Locations for Data Analytics in Europe Table 2 presents information about achieved data coverage. Table 2: Number of Job Postings by Country COUNTRY NUMBER OF JOB POSTINGS UK 251357 France 101661 Germany 140574 The Netherlands 108202 Switzerland 24253 Italy 20623 Belgium 20286 Poland 22837 Denmark 10844 Romania 15807 Austria 9857 Ireland 31510 Hungary 10301 Sweden 16360 Spain 16206 Czech Republic (Czechia) 11634 Portugal 25414 Malta 21 Norway 21 Bulgaria 35 Slovakia 3
17
Estonia 1 ToTAL >837.000
2.2.3 Literature analysis Additionally, to analyse job posting data, we have analysed an extensive literature collection, represented by the Microsoft Academic Graph [14] and recommended sources in the area of big data (presented in the Appendix 9.1).
The Microsoft Academic Graph [14] is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. This graph is used to power experiences in Bing, Cortana, Word, and in Microsoft Academic.
From the Microsoft Academic Graph we have extracted 4683 papers marked with a keyword “Big Data”. The recommended literature included paper collections in the following areas:
- Credit card data; - Mobile network data; - Mobile phone & wearables sensors data; - Network data; - Text Analytics; - Web data; - Wikipedia.
We have used JSI Wikifier described above to annotate and to extract relevant skills from the literature sources.
2.3 Skills analysis In order to analyse the skills demand in the area of big data for official Statistics, we have developed a methodology for skills analysis presented below.
2.3.1 Methodology for skills analysis The methodology for skills analysis includes the following steps:
1. Clustering of job postings with the OntoGen tool [15]. 2. Establishing relevant skills groups based on clustering outcomes, and results from the related studies and initiatives. 3. Detailed analysis of big data trends for different skill groups. Establishing the behaviour of each skill group and identification of the highly demanded skills within each skill group and emerging skills within each skill group. 4. Identification of the correlated skills for each defined skill group, as well as finding correlated skills for skills of particular interest (such as skills from Statistical tasks and Statistics and Business intelligence group). 5. Skills analysis for literature sources.
18
2.3.2 Clustering skills with OntoGen OntoGen is a semi-automatic and data-driven ontology editor focusing on editing topic ontologies (a set of topics connected with different types of relations). Using OntoGen allows for clustering documents and identifying a set of skill groups based on the collection of job postings.
Figure 5 and Figure 6 below present the results of job posting clustering. On Figure 5 we have highlighted a cluster of job postings that explicitly mention the “statistics” skill. On Figure 6 it is possible to notice the areas that reflect the database skill group (including such skills as “SQL” etc.) in the centre, soft skill group (including such skills as “leadership” et al.) and the upper-level skill group (including skills like “computing”, “artificial intelligence” etc.)
Statistics
Figure 5: Jobs Posting Content Clustering with OntoGen
19
Databases skills
Upper-level skills Soft skills
Figure 6: Jobs Postings Skills Clustering with OntoGen
2.3.3 Analysis by skills groups Based on OntoGen job postings clustering and other relevant skill classifications (for instance, the O’Reilly data science report), we have established a number of skill groups, taking into account needs of statisticians that operate with big data. In particular, in our skill group classification we included skills related to technologies for statisticians and data scientists, tasks for statisticians and data scientists, and soft skills required for statisticians and data scientists [16].
Figure 7 presents initial skills classification by groups:
- Soft skills, - Tasks and Methods (related to data science, Statistics, Administrative tasks for statistical purposes, IT tasks for statistical purposes, Budget tasks for statistical purposes) and - Tools and Technologies (related to Statistics and Business Intelligence, Soft Skills, Tasks, Architecture Technologies, Cloud Technologies, Data Management, Search Technologies, Data Mining Tools, Databases, Upper-Level Technologies, Hadoop, Programming Languages, Visualization Technologies).
20
Figure 7: Skills by Groups
Table 3 presents extended updated sets of individual skills for Soft skills and Tasks and Methods skills. Table 4 presents extended updated sets of individual skills for Tools and Technologies skills group. Table 3 and Table 4 constitute a big data skills framework.
Table 3: Skill Groups (Soft Skills, Tasks and Methods) SKILLS GROUP SKILLS SOFT SKILLS Communication Coordination Creative problem solving Delivery of results Ethics Information privacy Initiative Innovation and contextual awareness Leadership Logic Negotiation Specialist knowledge and expertise Teamwork
21
STATISTICAL AND DATA Algorithmic-based inference SCIENCE TASKS Analysis of aggregated data Analysis of micro data Data conversion Data curation Data dissemination Data governance Data processing Data querying Data resource management Data search Data storage Data visualization Design-based estimation Developing and maintaining statistical classifications Developing dashboards Developing prototypes Editing and Imputation techniques Model-based estimation Multivariate analysis Nonresponse adjustment and weighting Nowcasting and projections Quality assessment Sampling design Setting up data hubs Setting up data warehouses Spatial analysis/GIS/cartography Standardizing data Statistical confidentiality and statistical disclosure control Time series and seasonal adjustment User needs assessment ADMINISTRATIVE TASKS (FOR Conducting stakeholder’s consultation STATISTICAL PURPOSES) Contract negotiation Delivering training Documentation writing Drafting legal acts Managing contracts, grants or other agreements Managing task forces Preparing contracts, grants or other agreements Project management Quality assurance and compliance INFORMATION Analysis of requirements TECHNOLOGIES TASKS (FOR Developing software STATISTICAL PURPOSES) Hardware and infrastructure Security System Architecture Systems and software maintenance Testing of systems and software
22
User support
Table 4: Skill Groups (Tools and Technologies) SKILLS GROUPS SKILLS PROGRAMMING LANGUAGES C C# C++ ECL Go Java Javascript Julia Octave Python R Ruby Scala Visual Basic ARCHITECTURE TOOLS AND 5C architecture TECHNOLOGIES Data intensive computing Data intensive systems Distributed computing Distributed filesystems Distributed parallel architecture High-Performance Computing HPCC MIKE2.0 CLOUD TOOLS AND Cloud computing TECHNOLOGIES HADOOP Hadoop RHIPE YARN UPPER-LEVEL DATA SCIENCE Artificial intelligence TOOLS AND TECHNOLOGIES Business intelligence Data mining Deep learning Feature engineering Inductive statistics IoT (Internet of Things) Multimedia analysis Natural language processing Network analysis Stream processing and analysis Understanding algorithms Web technologies (Web scrapping, Web service et al.)
23
STATISTICS AND BUSINESS Adobe Analytics INTELLIGENCE Alteryx Apache Spark BusinessObjects Cognos Excel Jaspersoft Mathematica Matlab Microstrategy Oracle BI pbdR Pentaho PowerPivot SAS SPSS STATA DATA MANAGEMENT Amazon RedShift Apache Flume Apache HBase Apache Hive Apache Mesos Apache Oozie Apache Phoenix Apache Pig Apache Sqoop Apache Storm Apache ZooKeeper Aster Data (Teradata) BigQuery Cassandra Cloudera Impala EMC (Greenplum) Netezza Redis Splunk Vertica DATABASES Couchbase Database DBMS MongoDB MySQL NoSQL Oracle PostgreSQL Query languages RDBMS SAP HANA
24
SQL SQL Server SQLite Toad SEARCH TECHNOLOGIES ElasticSearch Lucene Search based applications Solr DATA MINING Apache Mahout BigInsights BigML Google Prediction LIBSVM Orange RapidMiner Scikit-learn Spark MlLib Vowpal Wabbit Weka VISUALIZATION Bokeh TECHNOLOGIES Chart.js ChartBlocks Chartist.js D3 Datawrapper Ember Charts FusionCharts ggplot Highcharts InfoVis Leaflet Matplotlib N3-charts NVD3 Plotly Polymaps Processing.js Sigma JS Shiny Tableau Visually ZoomData
2.4 Results Based on the skill group analysis and job posting data, we were able to observe the trends in different skill groups over time. The results from the skills analysis literature sources are presented in section 5.3.
25
2.4.1 Trending skills Figure 8 shows growing trends for tools & technologies, tasks & methods and soft skills in year 2016, which confirms the assumption that big data technologies and tasks will be actively developing in the next years. Figure 9 presents trends for Statistical tasks (based on the available data).
By looking at Figure 10 (which presents trends for different technologies), it is possible to notice that many big data technological areas (in particular, statistics and business intelligence) have increasing trends.
Furthermore, Appendix 9.2 presents more information about trending skills by groups.
Technologies, Tasks and Soft Skills trends
technologies tasks soft skills
jobpostings Linear (technologies) Linear (tasks) Linear (soft skills)
Figure 8: Technologies, Tasks and Soft Skills Trends
Statistical tasks jobpostings
1-1- 2-1- 3-1- 4-1- 5-1- 6-1- 7-1- 8-1- 9-1- 10-1- 11-1- 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016
Figure 9: Statistical Tasks and Methods
26
Tools and Technologies trends architecture cloud technologies
data management
data mining tools
databases
upper-level technologies
hadoop
programming languages
search
statistics and business intelligence
visualization jobpostings all technologies
Linear (architecture)
Linear (cloud technologies)
Linear (data management)
Linear (data mining tools)
Linear (databases)
Linear (upper-level technologies) Linear (hadoop)
Linear (search)
Figure 10: Tools and Technologies Trends by Groups In addition to trend analysis we have also identified highly demanded and emerging skills in different skill groups based on our data (Table 5, Table 6, Table 7).
Table 5: Highly Demanded Skills (Soft Skills, Tasks and Methods) SKILLS GROUP SKILLS SOFT SKILLS Specialist knowledge and expertise Creative problem solving Communication Innovation and contextual awareness Logic
27
STATISTICAL AND DATA Data querying SCIENCE TASKS Data search Data storage ADMINISTRATIVE TASKS (FOR Managing contracts, grants or other agreements STATISTICAL PURPOSES) INFORMATION Analysis of requirements TECHNOLOGIES TASKS (FOR User support STATISTICAL PURPOSES)
Table 6: Highly Demanded Skills (Tools and Technologies) SKILLS GROUPS SKILLS PROGRAMMING LANGUAGES Python Javascript C# ARCHITECTURE TOOLS AND Distributed parallel architecture TECHNOLOGIES 5C architecture High-Performance Computing CLOUD TOOLS AND Cloud computing TECHNOLOGIES HADOOP Hadoop YARN UPPER-LEVEL DATA SCIENCE Data mining TOOLS AND TECHNOLOGIES Artificial intelligence Network analysis Business intelligence Stream processing and analysis STATISTICS AND BUSINESS Matlab INTELLIGENCE SAS Excel Oracle BI DATA MANAGEMENT Redis Apache Hive Apache HBase Apache Sqoop Apache Oozie DATABASES Database DBMS SQL SQL Server SEARCH TECHNOLOGIES Search based applications DATA MINING Google Prediction VISUALIZATION Sigma JS TECHNOLOGIES
28
Table 7: Emerging Skills SKILLS GROUP SKILLS SOFT SKILLS Leadership Initiative STATISTICAL AND DATA Data curation SCIENCE TASKS
PROGRAMMING LANGUAGES ECL ARCHITECTURE TOOLS AND HPCC TECHNOLOGIES STATISTICS AND BUSINESS PowerPivot INTELLIGENCE DATA MANAGEMENT Cloudera Impala Apache Storm Amazon RedShift DATA MINING Scikit-learn VISUALIZATION D3 TECHNOLOGIES Bokeh Ember Charts Shiny FusionCharts Matplotlib Highcharts
Taking into account results from Table 5, Table 6 and Table 7 are a basis for future work of subsequent tasks in this project. In particular, for preparing a questionnaire for statistical offices.
2.4.2 Correlated skills Another analysis we performed included the identification of correlated skills for Statistical tasks and Statistics and Business Intelligence tools and technologies skill group at group level as well as at individual level (individual skills inside the group). Appendix 9.3 contains information on correlated skills by all skill groups.
Figure 11 presents the top correlated skills for Statistical tasks. As can be noticed, these skills are related to “databases”, “machine learning”, “data management”, web development etc.
29
Statistical tasks 70000
60000
50000
40000
30000
20000
10000
0
sql
php
html
d3.js
sales
nosql
html5
jquery
design
debian
analyst
devops
hadoop
backend
database
javascript
hardware
metadata
assurance
prototype
leadership
simulation
automation
semantic web semantic
user experience user
database design database
machine learning machine
riskmanagement
data management data
software architecture software
productmanagement software development software functionalprogramming Figure 11: Correlated Skills for Statistical Tasks
Statistics and Business intelligence Tools and Technologies 12000
10000
8000
6000
4000
numberjob of postings 2000
0
dba
html
soap
sales
nosql
cloud
mysql
sqoop
oracle
javaee
finance
analysis
security
robotics
pentaho
statistics
analytics
cloudera
database
postgresql
simulation wordpress
automation
web analytics web
user interface user
data modeling data
version control version
data conversion data
businessobjects
machine learning machine
businessintelligence
productmanagement distributed computing distributed software development software Figure 12: Correlated Skills for Statistics and Business Intelligence Skill Group
30
Figure 12 presents correlated skills for Statistics and for the Business Intelligence skill group – prominent skills that can be seen include “database”, “software development” and “analytics”, “machine learning”, “cloud technologies” and “distributed technologies”, “java”, “HTML” and “web analytics” etc.
If we look at the correlated skills for individual statistical skills, such as “Excel” (Figure 13), “SAS” (Figure 14) and “Matlab” (Figure 15), we can see a number of database-related skills, machine learning related skills, programming/statistical languages, such as C++, R, python, java, skills related to data science tasks and soft skills, such as “leadership”.
Correlated skills for other statistical tools/skills can be found in Appendix 9.4.
Excel 1200
1000
800
600
400 number jobpostings number of 200
0
r
sql
x86
xml
.net
java
json
html
linux
stata
nosql spark
cloud
oracle
debian
python
android
statistics
postgresql
peoplesoft
monitoring
web analytics web
user interface user
troubleshooting
database design database
customersupport
scripting language scripting
relational database relational
reverseengineering
artificialintelligence
project management project
amazon web amazon services distributed computing distributed software development software Figure 13: Correlated Skills for the Excel Skill
31
SAS 600
500
400
300
200 number jobpostings number of 100
0
c++
x86
jade
linux
sales
sybase
impala
fortran
node.js
finance
analysis
security
statistics atlassian
database
leadership
confluence
automation
data analysis data
apachespark
bioinformatics
.netframework
machine learning machine
riskmanagement
regression testing regression
scripting language scripting
relational database relational
amazon web amazon services
software engineering software distributed computing distributed software development software Figure 14: Correlated Skills for the SAS Skill
Matlab 400
350
300
250
200
150 number jobpostings number of
100
50
0
r
tcl
git
ios
java
laser
sales
scipy
xilinx
spark nosql
boost
mysql
oracle
design
debian
directx
android
analytics
architect
javascript
prototype
developer
automation
data analysis data
device driver device
user interface user
machine code machine
grid computing grid
.netframework
database design database
machine learning machine
imageprocessing
computerscience software engineer software project management project Figure 15: Correlated Skills for the Matlab Skill
32
2.4.3 Skills from literature analysis A detailed literature analysis allowed us to see what the most popular skills and skill groups in our literature collection are.
Tools and Technologies
Figure 16: Tools and Technologies by Groups from Literature Analysis Figure 16 shows the popularity of different technologies by skills groups. Appendix 9.5 contains a detailed view of particular technologies within each group.
2.5 Discussion Based on the results of the skills analysis, literature analysis and discussion with Eurostat, we propose the following skill groups classification (big data skills framework) in Table 8.
Table 8: Skills Classification as Basis for Questionnaire Soft skills
Communication Innovation and contextual awareness Coordination Leadership Creative problem solving Logic Delivery of results Negotiation Ethics Specialist knowledge and expertise Information privacy Teamwork Initiative
Statistical and data science tasks
Algorithmic-based inference Developing prototypes Analysis of aggregated data Editing and Imputation techniques
33
Analysis of micro data Model-based estimation Data conversion Multivariate analysis Data curation Nonresponse adjustment and weighting Data dissemination Nowcasting and projections Data governance Quality assessment Data processing Sampling design Data querying Setting up data hubs Data resource management Setting up data warehouses Data search Spatial analysis/GIS/cartography Data storage Standardizing data Data visualization Statistical confidentiality and statistical Design-based estimation disclosure control Developing and maintaining statistical Time series and seasonal adjustment classifications User needs assessment Developing dashboards
Administrative support tasks (for statistical purposes)
Conducting stakeholder’s consultation Managing contracts, grants or other Contract negotiation agreements Delivering training Managing task forces Documentation writing Preparing contracts, grants or other Drafting legal acts agreements Project management Quality assurance and compliance
Information technologies tasks (for statistical purposes)
Analysis of requirements System Architecture Developing software Systems and software maintenance Hardware and infrastructure Testing of systems and software Security User support
Statistics and business intelligence tools and technologies
Adobe Analytics Microstrategy Alteryx Oracle BI Apache Spark pbdR BusinessObjects Pentaho Cognos PowerPivot Excel SAS Jaspersoft SPSS Mathematica STATA Matlab
34
Visualization tools and technologies
Bokeh Matplotlib Chart.js N3-charts ChartBlocks NVD3 Chartist.js Plotly D3 Polymaps Datawrapper Processing.js Ember Charts Sigma JS FusionCharts Shiny ggplot Tableau Highcharts Visually InfoVis ZoomData Leaflet
Data management
Amazon RedShift Apache ZooKeeper Apache Flume Aster Data (Teradata) Apache HBase BigQuery Apache Hive Cassandra Apache Mesos Cloudera Impala Apache Oozie EMC (Greenplum) Apache Phoenix Netezza Apache Pig Redis Apache Sqoop Splunk Apache Storm Vertica
Databases
Couchbase Query languages Database RDBMS DBMS SAP HANA MongoDB SQL MySQL SQL Server NoSQL SQLite Oracle Toad PostgreSQL
Upper-level data science tools and technologies
Artificial intelligence Multimedia analysis
35
Business intelligence Natural language processing Data mining Network analysis Deep learning Stream processing and analysis Feature engineering Understanding algorithms Inductive statistics Web technologies (Web scrapping, Web IoT (Internet of Things) service et al.)
Data mining tools and technologies
Apache Mahout RapidMiner BigInsights Scikit-learn BigML Spark MlLib Google Prediction Vowpal Wabbit LIBSVM Weka Orange
Search tools and technologies
ElasticSearch Lucene Search based applications Solr
Programming languages
C Julia C# Octave C++ Python ECL R Go Ruby Java Scala Javascript Visual Basic
Architecture tools and technologies
5C architecture Distributed parallel architecture Data intensive computing High-Performance Computing Data intensive systems HPCC Distributed computing MIKE2.0 Distributed filesystems
36
Cloud tools and technologies Cloud computing
Hadoop
Hadoop RHIPE YARN
3. Existing skills in the statistical offices of the ESS, Eurostat and NSIs
3.1 Existing skills overview The analysis of existing skills plays an important role in big data training for official statistics. It can be performed via specialized questionnaires providing big data skills assessments. Another way to identify skills available at NSIs is to look at publicly available statistician profiles.
In order to obtain a glimpse on the skills of statisticians currently available, JSI conducted an experiment analysing statisticians’ profiles on LinkedIn1. LinkedIn is a business- and employment- oriented social networking service that operates via websites and mobile apps. LinkedIn is mainly used for professional networking, including employers posting jobs and job seekers posting their CVs.
In particular, in the experiment, JSI manually extracted skills from available statistician profiles obtained using the filter “National Statistical Institute”. Skills from the profiles of 40 statisticians working- or having previously worked- in European NSIs have been extracted and analysed.
Figure 17 presents the most popular skills from the statisticians’ profiles. Among the top skills, it is possible to notice general skills (like statistics, economics, international relations etc.), tools and technologies for statistics (SPSS, Stata etc.), skills related to tasks and methods (like project management) and in particular, to statistical tasks (statistical modelling, forecasting etc.), soft skills (like leadership) and skills related to programming languages, databases (like R, SQL).
The analysed dataset reflects the necessity of raising the awareness about big data methods, tools and technologies, since big data skills and pre-requisite skills for working with big data are rarely met in statistical profiles.
1 https://www.linkedin.com (accessed in September 2017) 37
Skills mentioned in LinkedIn statisticians profiles 30
25
20
15
10
5
0
r
sql
spss
stata
english
strategy
teaching
statistics
analytics
databases
leadership
economics
forecasting
data analysis data
survey design survey
policy analysis policy
analyticalskills
microsoft excel microsoft
public speaking public
europeanunion
capacity building capacity
strategic planning strategic
statistical modeling statistical
project management project
quantitative analytics quantitative
international relations international
software development software
international development international organizational development organizational software project management project software
Figure 17: LinkedIn Experiment
Table 9 presents the skills extracted from statistician’s profiles in the LinkedIn experiment according to the big data skills framework.
Table 9: Skills Based on LinkedIn Experiment According to Skills Framework SKILLS GROUP SKILLS FROM LINKEDIN STATISTICIAN PROFILES
SOFT SKILLS Communication, Teamwork, Information privacy, Negotiation, Ethics, Coordination, Leadership
STATISTICAL AND DATA Algorithmic-based inference, SCIENCE TASKS Statistical confidentiality and statistical disclosure control, Multivariate analysis, Time series and seasonal adjustment, Data visualization, Data resource management,
38
Data governance, Setting up data warehouses ADMINISTRATIVE TASKS (FOR Project management, STATISTICAL PURPOSES) Delivering training INFORMATION Testing of systems and software, TECHNOLOGIES TASKS (FOR Security, STATISTICAL PURPOSES) Developing software
PROGRAMMING LANGUAGES Visual Basic, C++, Java, Python, R ARCHITECTURE TOOLS AND Distributed computing TECHNOLOGIES UPPER-LEVEL DATA SCIENCE Machine learning, TOOLS AND TECHNOLOGIES Databases, Data mining, Business intelligence STATISTICS AND BUSINESS Excel, INTELLIGENCE SAS, Matlab, SPSS, Stata DATA MANAGEMENT DATABASES SQL, Oracle, SAP VISUALIZATION Tableau TECHNOLOGIES
4. Analysis of the training needs of the statistical offices In order to obtain an overview of the training needs in the area of big data from the inside perspective, JSI conducted a big data training needs survey targeted at big data focal points in European NSIs.
The survey contained a set of the following questions to be answered and commented by respondents:
- What skills need to be acquired? - What data sources should be covered? - How many staff members need training? - By when do they need it? - What are the priorities?
Appendix 9.6 presents a questionnaire. Sections 4.1, 4.2 and 4.3 provide more details about the survey and the obtained results.
39
4.1 Big data training needs survey overview The Big Data Training Needs Survey has been conducted in July-September 2017. The questionnaire was sent to the big data focal points in 28 NSIs +3 EFTA countries.
By the 30th September 2017, 20 replies (Germany, Denmark, Hungary, Malta, United Kingdom, Poland, Austria, Czech Republic, Finland, Belgium, Latvia, Spain, Estonia, France, Slovenia, Cyprus, Slovakia, Croatia, Luxembourg and Ireland) were received.
Figure 18 presents NSIs by countries that reacted to the survey (marked in red).
Figure 18: Big Data Training Needs Survey - Answers Map
4.2 Big data training needs survey results 4.2.1 Skills to be acquired NSIs have identified different skills relevant to the use of big data in official statistics. The skills can be aggregated in the following groups:
1. Methodological skills; 2. Technical skills; 3. Visualization and storytelling skills; 4. Contextual skills and 5. Soft skills.
It was proposed that each group can have foundational/introductory and advanced skills level.
40
The skills groups based on the suggested types are described below.
Methodological skills
Introductory/Foundational skills (cited from the questionnaire replies):
- Understanding big data sources in terms of their usability; - Introduction to new data science techniques (e.g. machine learning, NLP); - Machine Learning in general; - Methodology, both traditional-statistical and specific to big data with the emphasis on data exploration, pattern recognition and information extraction, modelling; - Statistical modelling knowledge (understand the scope, content, units, etc. of information stored in different data sources and how to build a common dataset / how to clear datasets to make them possible to combine, design and apply statistical models; - Acquisition, mining and processing big data; - Methods of cleaning, editing source data & adjustment in large datasets; - Knowledge of new data structures and new data science techniques; - Knowledge of data base models associated with big data; - Data streaming.
Several NSIs (Hungary, Malta, Estonia, Slovenia) mentioned importance of exploring different methods of combination and data linking of different data sources:
- How to use different data sources (statistical surveys, administrative data, big data, other) and methods to combine them (data linking, matching techniques); - Efficient data linkage methods in order to link big data with primary or auxiliary secondary data; - Data linking methods (when direct links do not exist).
Other foundational skills might also include:
- Bayesian Learning; - Analysis of time series and prediction; - Automatic learning; - Statistical skills such as quantitative and qualitative analysis, weighting, inference, validity and modelling.
Advanced skills would include:
- Advanced big data analytics, modelling with big data; - Methods to detect and manage atypical data (outlier treatment); - How to practically apply classification methods and other machine learning methods, text analysis methods; - Nowcasting; - Data mining methods; - Deep Learning; - Selecting the relevant information from huge volumes of data (high-dimensionality, high- frequency or both);
41
- How to deal with interoperability issues, access and use big data for statistical purposes and consider samples of it instead of the entire dataset for confidentiality purposes (e.g. scanner data); - Dealing with computational time issues (feasible algorithms, choice of software); - Fault-tolerance and resilience; - Efficient programming.
As a summary for this group of skills, it is possible to say that Foundational skills in general map to the Different tasks and methods and Upper level technologies from the skills framework.
It is clear that NSIs would like to have trainings that would, at the introductory level, include a more general overview of technologies that allow to work with big data and at the advanced level, that would provide more knowledge about methodologies and technologies in the area of big data.
Technical skills
Technical skills represent tools and technologies relevant for the use of big data. Depending on the training programme objective, the Technical Skills can also be grouped into Introductory Skills and Advanced Skills.
The Introductory Skills would include:
- Fundamentals of software and hardware to meet the necessary technologies for the collection, storage, processing and reporting, especially for large volumes of data as a basis for the development and adaptation of such techniques to specific problems;
Many NSIs (Spain, Poland, UK, Denmark, Belgium) mentioned that they would like to obtain training for Cloud Technologies and Distributed Processing (cited from questionnaire replies):
- Introduction to distributed processing; - Cloud computing; - Distributed computing; - Applied knowledge of distributed processing methods (e.g. Spark); - Distributed Memory Systems (* Clusters * Clouds); - Computing platforms for big data (* Computing architectures* Parallel computing * big data frameworks specially hadoop ecosystem and spark); - Storage systems for big data (* Distributed file systems * Parallel file systems * Storage technologies); - Affinity for new technologies (open source tools, Hadoop), hands-on experiences with such tools; - Parallel and distributed computing paradigms; - Mapreduce2 (pig, hive, giraph, sqoop, mahout).
Spark framework and Hadoop are among the most popular technologies:
- Spark (MLib, Spark SQL, Streaming, GraphX); - Hadoop file system;
42
- Impala; - Storm; - Flume.
Some countries included Databases and Data Structures skills:
- Management of large (sometimes unstructured) datasets; - Using, building, maintaining databases; - Basic SQL knowledge; - NoSQL (MongoDb); - Introduction to data structures (e.g. JSON, XML, SQL/NoSQL); - More complex data structures (e.g. graph databases).
Programming skills are usually represented by R and Python:
- Programming/technologies/scripting languages; - Basic proficiency in R and/or Python; - Scala.
Several responses (Latvia, Poland) also mention the following Statistical skills (tools):
- SAS; - SPSS.
Web scraping plays an important role in the NSIs expectations. They would like to learn how to create and use web scraping tools.
Additionally, respondents mentioned the following skills/technologies/needs:
- Tensorflow; - mxnet; - System administrator’s needs; - Introduction to code repositories (e.g. Github); - Collaborative project management, collaborative project development tools.
Visualization and storytelling skills
Visualization skills play an important role in modern data science.
Introductory visualization skills include (cited from questionnaire replies):
- Data Analytical/Visualisation skills; - How to communicate new indicators in an easy and understandable ways (especially, interactive and dynamic outputs); - Introduction to basic visualisation techniques (e.g. using R, Shiny, Bokeh).
Advanced Visualization Skills:
- Data analytical/visualisation skills such us combining various data processing techniques; - More advanced visualisation (e.g. D3)
43
Contextual skills
Other important big data skills are related to quality issues, security and privacy issues.
- Knowledge about context and environment, data and data owners: where are the data, owned by whom, how are they generated and stored, what are their characteristics (technical, legal and privacy status, …), what are their limits. - Understanding of security, quality and privacy issues; - Quality issues in big data; - Innovation and contextual awareness; - Information security and technology risk management;
Soft skills
The NSIs listed the following Soft Skills needed for the use of big data in official statistics
- Teamwork; - Communication; - Cooperation and negotiation; - Creative and innovative mind-set; - Interpersonal and communication skills; - Leadership and Strategic Direction; - Judgement and decision-making; - Management and delivery of results; - Building relationships and communication.
4.2.2 Data sources to be covered The NSIs often mention that they would like to obtain skills fit for data sources/data types:
- Analysis and exploitation of web data; - Social media analysis; - Working with structured, unstructured and semi-structured data.
Textual data, web scrapped data and sensor data are the most popular targeted big data types and data sources.
For instance, unstructured or semi-structured data types or data sources:
- Textual data;
- Social media data;
- Sources providing unstructured data that needs a lot of transformation, e.g. natural language processing or statistical image analysis. This could also be called feature extraction. As a two-way
44
characterization of a data source one should consider the complexity of the data (messy/unstructured data vs straight forward) and the amount of data (the need for distributed computing may be more critical for some sources).
- Other sources that are more closely related to text mining or of qualitative nature should also be presented as some other potential examples. But detailed training is not needed on these areas.
- Large scale and/or complex administrative data sources.
Web scraped data is the most frequent data source mentioned in the responses (cited from the questionnaire replies):
- The web-scraping techniques and potential sources for web-scraping (typically prices) is very important and relevant for actual developments. This should definitely be covered in the training (sources, tools, methodological problems and solutions, examples). - Web-scraped data have not yet proved to be a success, but we are looking forward into that in certain fields. We have other sources as well but no special training needs recognized for those. - Web-scraped data (INE). - Web scraping as an additional source for update of registers (e.g. tourism register).
Smart meter data and sensor data are targeted as well (cited from the questionnaire replies):
- Water and Electricity consumption; - Smart meter data; - Sensor data, any other potential source that is quantitative in nature (financial transactions, sensors) should also be included (sources, tools, methodological problems and solutions, examples); - Sensor data (especially images).
Structured data:
- Traffic loops data and similar data sources.
Another expected sources are mobile phone data or mobile telephony services data.
NSIs would also like to analyse Scanner data.
Financial transaction data would be appropriate data source for several NSIs (Malta, Hungary, Spain, Austria).
Finally, satellite and aerial photo data and License Plates data have been mentioned.
45
4.2.3 Staff members training Depending on the number of employees in a particular NSI, different numbers have been mentioned – starting from 3-4 people (including all professional types) to one statistician per domain, one methodologies per domain and several IT experts. Several NSIs responded that they would like all their staff to be trained.
For instance, in case of Slovenia, it is listed that in the general term (understanding of big data sources, usage of big data sources) many NSI members (especially subject matter statisticians) should be trained. In the narrow term (processing of big data, creating statistics and quality indicators) only a few of NSI members (two methodologists and two IT experts) should be trained.
4.2.4 Training needs for different staff members NSIs suggested that the training should be different according to each staff types and at different levels. For instance, the NSI of Denmark proposed two levels of courses (introductory and deep level) at different staff types. Previously, in section 4.2.1 about required skills introductory and advanced levels of skills have been discussed.
The training can be targeted at individual experts as well as at a data scientists’ team.
Training targeted at levels:
- Introductory level, i.e. knowledge but not necessarily ability to implement. This is a prerequisite for having an informed opinion about e.g. the technical solutions needed. The foundational skills for statistician vs methodologist would be fairly similar. However, more advanced big data skills would likely deviate. For example, visualization may become more important for domain statisticians. Methodologists may need to incorporate specific big data knowledge into specific areas of expertise (e.g. use of computationally intensive algorithms for disclosure checking). IT experts really have a completely different set of training requirements. The emphasis here will be much more on data engineering and distributed processing. They should already have good programming skills and don’t necessarily need knowledge of statistics. Their needs should be met through a separate training course. - Deep level, ability to implement and develop solutions further. Here the participants are already proficient with the technology, and the format of the training will more be that of a workshop, where people work together to solve a specific problem applying techniques at a high level.
Training targeted at data science teams
Generally, it has been thought that there should be data science teams as data science itself requires versatile skills and no one is expected to handle all of those. This would mean specific training needs for experts but also some common more general ones. Teams can be composed of different specialists rather than employing single data scientists. The training needs are different for: Team leader and experts.
For instance, the focal point from France mentions that the objective would be to create multidisciplinary teams.
Training different by profiles:
- IT support group;
46
- Statistical group/methodologists/domain units;
(Latvia)
- Infrastructure, Fault tolerance techniques, security, optimization, storage... for IT experts; - Machine learning, and big data frameworks for methodologists and statisticians.
(Spain)
- IT tools for managing big data sources (IT infrastructure is not an issue). IT experts and (or) general methodologist. - Skills regarding the potential of big data sources and their usage in statistical production should be related mostly to statisticians in particular domains. - Methods for nowcasting. Those skills are related mostly to general methodologist. - Negotiating skills. Here we have real drawback (in my experience and opinion) due to the fact that people do not have skills to negotiate with data providers in order to ensure data access. This is crucial.
(Slovenia)
- General statistical methodologists - Statisticians in a particular domain - IT experts
(Slovakia)
Training related to different data types/data sources:
- Sensor data, mobile phone data, financial transaction data etc. are similarly structured data. If one has an experience with one source of such data, it is quite easy for him (her) to switch on similar sources. General methodologists are the most appropriate to work with mentioned data. - Same goes for unstructured data (web scraping, social media data). General methodologists are the most appropriate to work with mentioned data. It experts with deep methodological knowledge are also suitable (web scraping, detection of phenomena of interests).
4.2.5 Training timeline Many NSIs mentioned that trainings should start as soon as possible, be continuous - NSIs have listed 2018 as possible starting point.
The NSI of France states that the training should start within two years.
47
The NSI of Belgium specifies that some data types (scanner data used for CPI and HICP) are already operational in statistical purposes and that in the next 2-3 years, mobile phone data, smart meter data and web-scraped data would be used in statistical offices.
The NSI of the UK mentions that they have access to data (textual data) that they would like to start analysing.
In such way, several NSIs have already started trainings.
One NSI mentioned that trainings should be project-oriented.
Two NSIs stated that using big data is a long term goal and that trainings can be organized in the coming years.
4.2.6 Training needs priorities Priorities for training methods/knowledge transfer types:
- For the introductory level and acquiring specific programming skills a cost-effective solution would be web based courses. E.g. learning Spark is better done following a specialization on Coursera rather than attending a one week course on site. As mentioned, for intermediate and especially advanced courses the format could be more workshop like (“jam session”). One NSI mentioned that they are not yet at the stage where they can transmit skills systematically and formally - they are still building knowledge. Knowledge is being transmitted at present informally and ad hoc from the people who have worked with the data and developed methods to statistical domain specialists in two main areas. - Another NSI responds that training methods don’t have to be presential (face-to-face), can be webinars, online courses, use of social media, et. - Some of the training could be done by just compiling study material and making it electronically available. Otherwise, webinars are useful to learn to use a specific software. Also, some kind of support channel or dedicated wiki page with discussion board option would be very useful. - NSI would like to have more expert courses for a smaller number of participants with high level of expertise and use more external courses for this group.
Priorities for data sources/data types:
- Web scraping data and mobile phone data are most popular data types for data source priorities, - Sensor data, - Scanner data (Price Statistics), - Water and Electricity consumption (various Social and Economic domains), - Financial transaction data (various Economic domains) and - Textual data (in general) are mentioned. - Priority should be focused on guaranteeing access to data sources, such as smart meters and mobile phone data.
48
Priorities for technologies/methods/skills:
- Foundational data science skills. - Best practices or guidelines for forming big data partnership, - Technical aspects of big data processing, - Methods for big data processing.
- Methods to combine big data sources with other types of data sources.
- Managing large (sometimes unstructured) datasets. - Using, building, maintaining databases.
- Selection of the relevant information from huge volumes of data (high-dimensionality, high-frequency or both.
- Methods for nowcasting (rapid estimates).
- Hands-on experiences with big data IT tool. - Setting up test environment so skills required by the system administrator are the most needed. - IT-skills and methods applied to the web-scraped data are important, as well as text analysis skills.
- Quality measures for processing of big data and calculated statistics.
- Programming and collaborative management tools.
Priorities for other issues:
- Time constraints is the major constraint, a lot of tutorials, training material is already available for free on the Internet, - Trainings up to 1 week, trainings on the job.
4.3 Big data training needs survey summary The Big Data Training Needs Survey has been conducted in July-September 2017. The responses received from the big data focal points included filled questionnaires (from 17 NSIs) and general comments (from 3 NSIs).
In particular, the survey defined the groups of skills that NSIs would like to acquire:
- Methodological skills; - Technical skills; - Visualization and storytelling skills; - Contextual skills and - Soft skills.
49
A number of the individual skills, such as skills targeted at understanding big data in general and linking big data to statistical data sources, stand out. Cloud Technologies and Distributed Processing, Spark framework and Hadoop, R and Python are among most required tools and technologies. According to NSIs, visualization can be addressed with R, Shiny, Bokeh and D3. Quality issues of big data, understanding context and environment of data and data owners are frequently mentioned skills as well as soft skills, such as teamwork, communication, leadership et al.
In order to assess existing skills for statisticians with skills listed in big data skills framework JSI analysed a set of skills listed in the big data Training Needs survey.
Table 10 presents skills identified by the Big Data Training Needs Survey according to the big data skills framework. It shows in details the skills that NSIs would like to target.
Table 10: Skills from Big Data Training Needs Survey According to Skills Framework SKILL GROUP SKILLS FROM BIG DATA TRAINING NEEDS SURVEY REPLIES
SOFT SKILLS Communication, Innovation and contextual awareness, Teamwork, Creative problem solving, Negotiation, Leadership, Delivery of results, Information privacy, Coordination
STATISTICAL AND DATA Nowcasting and projections, SCIENCE TASKS Nonresponse adjustment and weighting, Analysis of aggregated data, Multivariate analysis, Time series and seasonal adjustment, Quality assessment, Data visualization, Data resource management, Setting up data warehouses, Data storage, Data processing, Data conversion ADMINISTRATIVE TASKS (FOR Quality assurance and compliance, STATISTICAL PURPOSES) Project management INFORMATION System Architecture, TECHNOLOGIES TASKS (FOR Hardware and infrastructure, STATISTICAL PURPOSES) Developing software, Systems and software maintenance
PROGRAMMING LANGUAGES R, Python, Scala
50
ARCHITECTURE TOOLS AND Distributed Parallel architecture, TECHNOLOGIES Distributed computing, Distributed filesystems CLOUD TOOLS AND Cloud computing TECHNOLOGIES HADOOP Hadoop UPPER-LEVEL DATA SCIENCE Machine learning, TOOLS AND TECHNOLOGIES Databases, Understanding algorithms, Data mining, Deep learning, Artificial intelligence, Natural language processing, Stream processing and analysis, IoT (Internet of Things), Multimedia analysis, Web technologies (Web scrapping)
STATISTICS AND BUSINESS SAS, INTELLIGENCE Apache Spark, SPSS DATA MANAGEMENT Apache Hive, Apache HBase, Apache Sqoop, Apache Pig, Cloudera Impala DATABASES Passively parallel-processing databases (MPP), DBMS, SQL, NoSQL, MongoDB SEARCH TECHNOLOGIES Search based applications VISUALIZATION D3, Shiny, Bokeh TECHNOLOGIES
Several data types / data sources, such as web-scrapped data, mobile phone data, sensor data, scanner data etc. have been frequently listed as priorities.
The survey also defined that the training should be performed at different levels (introductory and advanced). Training can be targeted at different profiles of employees. Individuals and big data teams can receive training. Training should take into account the big data sources and types that would be addressed in ESS.
The minimum and maximum number of trainees varies depending on the size of the NSI and the NSI training strategy.
The priorities defined in the survey included:
51
- Priorities for training methods/knowledge transfer types (such as webinars, online courses along with face-to-face training);
- Priorities for data sources/data types (webs-craped data and mobile phone data are frequently mentioned in the priorities);
- Priorities for technologies/methods/skills (like introductory big data methodologies and skills);
- Priorities for other issues (trainings up to 1 week, trainings on the job).
4.4 Towards bridging the skills gap for big data in statistics From the information presented in the previous sections it is visible that there is a gap between the skills of statisticians from the NSIs that would like to work with big data and the skills that would allow to analyse big data for statistical purposes. Big Data Training Needs Survey provided valuable insight into the way to bridge this gap and gave a basis for developing the training strategy for the NSIs.
In particular, from the Big Data Training Needs survey it is possible to see how the European NSIs see the training aspects in relation to their working environment in coming years.
One of the aspects defined by the Survey is that training should be provided at different levels, where many team members would receive the introductory trainings with a possibility to proceed to an advanced level through advanced training.
Some training courses could also be delivered in such way that it would allow involving the whole team – methods of online training can be seen as possible delivery option.
The majority of the NSIs indicated that they would like to start the training as soon as possible, in 2018.
5. Training objectives and content for the design of a training program
5.1 Learning models and curriculum design approaches
5.1.1 Learning models In the modern literature there are a number of learning models and curriculum design approaches. The Horizon 2020 EDISON project (1 September 2015 – 31 August 2017) [1] developed a framework for what defines the data science profession. The EDISON framework includes such components as Data Science Competences Framework, Data Science Body of Knowledge, Data Science Model Curriculum and Data Science Professional Framework.
The learning models frequently described in the literature (and addressed in the EDISON project) are
Bloom’s taxonomy;
Problem-based learning and
Competences-based learning.
52
Bloom’s taxonomy
Bloom’s taxonomy [18] provides a conceptual framework to organize levels of learning of a topic or subject, and assigns action verbs to each level that help to understand activities related with particular levels of learning.
Figure 19 presents the structure of Bloom’s taxonomy.
The levels of Bloom’s taxonomy include remembering level (where students identify the relevant technologies), understanding level (where students can explain how technologies work), applying level (where the right technology for a specific problem is chosen), analyzing level (where relationships are analyzed), evaluating level (where the judgements are made) and creating level (where the new solutions are created).
Figure 19: Bloom’s taxonomy
Constructive Alignment and Problem-based Learning
Contrary to the traditional learning models, where the students are provided with knowledge by a teacher and passively take it, memorizing the schemes, then are evaluated with examinations, constructive alignment model gives students a central role in the learning process and knowledge construction process [19].
53
Problem-based learning
Problem Based Learning (PBL) [20,21] is based on problems provided to students to solve based on teacher consultations. PBL assumes active student involvement and motivation through evaluation. Problem- based Learning is considered to be one of the forms of constructive alignment. Constructive Alignment was described by Biggs [22]. "Constructive alignment" refers to the learner constructing his or her learning through relevant learning activities. The teacher in this process sets up a learning environment that supports achieving the desired learning outcomes. In particular, in the area of computer science the constructive alignment process was described by Ben-Ari [23].
The EDISON framework developers state that constructive alignment and problem-based learning can be implemented in a form of project-based learning – the regular classes would provide students with competences related to specific knowledge areas, while additional project classes allow to establish a link between these competences [24].
Competence Based Learning Model
Competency Based Learning (CBL) or Competence Based Education (CBE) also known as outcomes based learning uses an approach different from the one in traditional education. Instead of focusing on how much time students spend learning a particular topic or concept, the CBL assesses whether students have mastered the given competencies - the knowledge, skills, and abilities [25]. CBL is often used for re-skilling or additional training scenarios. The benefits of CBL is its flexibility, since it allows both self-study and instructor guidance. The CBL programs usually offer the following features [26]:
• Self-pacing
• Modularization
• Effective assessments
• Intentional and explicit learning objectives shared with the student,
• Anytime/anywhere access to learning objects and resources,
• Personalized, adaptive or differentiated instruction
• Learner supports through instructional advising or coaching.
In the literature it is stated [26] that the CBL was actually created to address the needs of non-traditional students who cannot devote their entire time to traditional academic studies as well as effective models
54 for companies to provide (re/up) skilling to their staff. In such way, the CBL approaches or mixed approaches involving CBL seem to be suitable for modeling training objectives and training content for the trainees in big data area for statistical offices in Europe.
5.1.2 Curricula guidelines The ACM Committee for Computing Education in Community Colleges (CCECC) and the IEEE Computer Society have jointly produced curricular recommendations and guidelines for baccalaureate computing programs, known collectively as the ACM Computing Curricula series. The guidelines provided included the ACM Competency Model of Core Learning Outcomes and Assessment for Associate-Degree Curriculum in Information Technology (IT2014) [25]. Mainly, the recommendations focus of the student competencies instead of credit points. The measurable learning outcomes take into account Bloom’s Taxonomy.
5.2 Related curricula and classifications
5.2.1 ACM classification for computer science In the ACM classification for computer science [27] the Body of Knowledge is defined as a specification of the content to be covered in a curriculum as an implementation. The ACM Body of Knowledge includes 18 Knowledge Areas (KA):
AL - Algorithms and Complexity
AR - Architecture and Organization
CN - Computational Science
DS - Discrete Structures
GV - Graphics and Visualization
HCI - Human-Computer Interaction
IAS - Information Assurance and Security (new)
IM - Information Management
IS - Intelligent Systems
NC - Networking and Communications (new)
55
OS - Operating Systems
PBD - Platform-based Development (new)
PD - Parallel and Distributed Computing (new)
PL - Programming Languages
SDF - Software Development Fundamentals (new)
SE - Software Engineering
SF - Systems Fundamentals (new)
SP - Social Issues and Professional Practice
The courses that compose Computer Science curriculum should cover two types of topics - mandatory for each curriculum (“Tier-1”) and expected to be covered at least at 80% (“Tier-2”). Tier 1 and Tier 2 topics are defined differently for different programmes and specializations.
The work-place skills from the ACM classification describe the ability the student/trainee to:
function effectively as a member of a diverse team, read and interpret technical information, engage in continuous learning, professional, legal, and ethical behavior, demonstrate business awareness and workplace effectiveness
Each KA in the ACM classification is organized into a set of Knowledge Units (KU). In the final step each KU lists a set of topics and learning outcomes (LO). The LO are associated with a level of mastery derived from the Bloom taxonomy (familiarity, usage, and assessment).
5.2.2 Curriculum development in EDISON project The Data Science model curricula developed within the EDISON project is based on the ACM guidelines, taking into account Competence Based Learning model. The curriculum is organized as core and elective topics, following the ACM definition. Core topics are required to every data science program while Elective topics are specific for particular areas.
56
Table 11 presents the knowledge levels for learning outcomes in Data Science model curricula of EDISON project.
Table 11: Knowledge Levels for Learning Outcomes in Data Science Model Curricula (MC-DS)
Level Action Verbs Familiarity Choose, Classify, Collect, Compare, Configure, Contrast, Define, Demonstrate, Describe, Execute, Explain, Find, Identify, Illustrate, Label, List, Match, Name, Omit, Operate, Outline, Recall, Rephrase, Show, Summarize, Tell, Translate Usage Apply, Analyze, Build, Construct, Develop, Examine, Experiment with, Identify, Infer, Inspect, Model, Motivate, Organize, Select, Simplify, Solve, Survey, Test for, Visualize Assessment Adapt, Assess, Change, Combine, Compile, Compose, Conclude, Criticize, Create, Decide, Deduct, Defend, Design, Discuss, Determine, Disprove, Evaluate, Imagine, Improve, Influence, Invent, Judge, Justify, Optimize, Plan, Predict, Prioritize, Prove, Rate, Recommend, Solve
Annex 8.7 presented below provides a template and examples for defining the Learning Outcomes related to enumerated Data Science Competences Framework (CF-DS).
Developing training objectives for a big data training program would follow the learning outcomes model, where each topic (knowledge area) is supported by a number of learning outcomes.
5.2.3 Curriculum development in EDSA project The European Data Science Academy (EDSA) project [6] designs curricula for data science training and data science education across the European Union (EU). The EDSA establishes a virtuous learning production cycle whereby: a) the required sector specific skillsets for data scientists across the main industrial sectors in Europe are analyzed; b) modular and adaptable data science curricula to meet industry expectations are developed; and c) data science training supported by multi-platform and multilingual learning resources are delivered.
The EDSA courses portfolio
The EDSA provides a wide spectrum of courses from the following categories:
Self-study courses: These courses consist of self-study learning materials available as Open Educational Resources (OERs). Learners can study them at their own pace, as there is no predetermined start or end date.
57
MOOCs: These Massive Open Online Courses (MOOCs) are available on external MOOC platforms, such as FutureLearn. Blended courses: These courses are taught in a blended way (face-to-face and online) by EDSA partners and associate EDSA partners. Face-to-face courses: These courses are taught face-to-face by EDSA partners and associate EDSA partners.
As it can be seen from the above list, the EDSA courses cover all types of learning contexts, from the traditional face-to-face pedagogical model, to the more recent trends in online education (MOOCs and OERs).
Delivery channels and formats
EDSA courses are delivered:
• Via the Moodle Learning Management System (HTML format) • As an eBook (available via iBooks, in ePUB format)
EDSA curricula
EDSA curriculum contains courses in several stages, such as:
Foundations; Storage and Processing; Analysis; Interpretation and Use.
Tables 12-13 present the evolution of EDSA curriculum in time.
Table 12: Core EDSA Curriculum, version 1
Topic Stage
Foundations of Data Science Foundations
Foundations of Big Data Foundations
Statistical / Mathematical Foundations Foundations
Programming / Computational Thinking (R and Python) Foundations
Data Management and Curation Storage and Processing
Big Data Architecture Storage and Processing
Distributed Computing Storage and Processing
Data Intensive Computing Storage and Processing
58
Machine Learning, Data Mining and Basic Analytics Analysis
Big Data Analytics Analysis
Process Mining Analysis
Data Visualisation Interpretation and Use
Visual Analytics Interpretation and Use
Finding Stories in Open Data Interpretation and Use
Data Exploitation including data markets and licensing Interpretation and Use
Table 13: Core EDSA Curriculum, version 3
Topic Stage
Foundations of Data Science Foundations
Foundations of Big Data Foundations
Statistical / Mathematical Foundations Foundations
Programming / Computational Thinking (R and Python) Foundations
Data Management and Curation Storage and Processing
Big Data Architecture Storage and Processing
Distributed Computing Storage and Processing
Data Intensive Computing Storage and Processing
Linked Data and the Semantic Web Storage and Processing
Machine Learning, Data Mining and Basic Analytics Analysis
Big Data Analytics Analysis
Process Mining Analysis
Social Media Analytics Interpretation and Use
Data Visualisation and Storytelling Interpretation and Use
59
Data Exploitation including data markets and licensing Interpretation and Use
Table 14 demonstrates recommendations towards the EDSA curriculum development that was taken into account while developing training objectives for statistical offices in the area of big data in Europe.
Table 14: Recommendations for EDSA Curriculum Development
Title Intervention level Summary description
Holistic training General training Refine the training approach and curriculum cycle to approach approach strengthen skills along the full data exploitation chain.
Open source Existing curriculum Continue current technical and analytical training based based training design on open source technologies; apply cross-tool focus to deliver overarching training.
Soft skills Expansion of Implement soft skill training to increase performance training curriculum and organization impact of data scientists / data science teams.
Basic data Expansion of Develop basic data literacy and data science training for literacy training curriculum non-data scientists to improve basic skills across organizations and facilitate uptake of data-driven decision making and operations.
Blended training Course delivery Develop blended training approaches including sector- specific exercises and examples to increase effectiveness of training delivery.
Data science Training approach Implement data science skills framework to structure skills framework and delivery skills requirements, assess skills of data scientists, and identify individual skills needs.
Navigation and Training market Develop quality assessment of third party courses; guidance provide navigation support to identify relevant trainings from EDSA and third parties.
The EDSA project experience in creating personalized environment for learning is considered useful for developing training opportunities for statistical offices in Europe in big data area.
60
5.3 Training objectives for statistical offices in Europe in the area of Big Data
5.3.1 Defining training objectives Previously, in Task 4, the skill groups (Tables 3 and 4) and skills from the Big Data Training Needs Survey (Table 10) have been defined.
Taking into account the learning models, relevant initiatives and big data training needs described above, Table 15 presents a set of training objectives (based on learning outcomes) for big data training in the statistical offices in Europe.
Table 15: Training Objectives Mapped to Big Data Training Needs TO ID Big data topics LO by Knowledge levels Familiarity Usage Assessment Choose, Classify, Collect, Apply, Analyze, Adapt, Assess, Change, Compare, Configure, Build, Construct, Combine, Compile, Compose, Contrast, Define, Develop, Examine, Conclude, Criticize, Create, Demonstrate, Describe, Experiment with, Decide, Deduct, Defend, Execute, Explain, Find, Identify, Infer, Design, Discuss, Determine, Identify, Illustrate, Label, Inspect, Model, Disprove, Evaluate, Imagine, List, Match, Name, Omit, Motivate, Improve, Influence, Invent, Operate, Outline, Recall, Organize, Select, Judge, Justify, Optimize, Plan, Rephrase, Show, Simplify, Solve, Predict, Prioritize, Prove, Summarize, Tell, Survey, Test for, Rate, Recommend, Solve Translate Visualize SOFT SKILLS TO1 Communication, Use relevant soft skills to Coordinate, organize Be able to assess the solve the problem, obtain and lead the team. current working strategy Innovation and results, communicate and and in particular, the contextual deliver them at different Identify the strategy obtained results. awareness, levels. for task execution. Be able to prioritize Teamwork, Motivate the team. important tasks. Creative problem solving, Be able to criticize or defend the selected Negotiation, position.
Leadership, Be able to predict and Delivery of give recommendations. results,
Information privacy,
Coordination
61
STATISTICAL AND DATA SCIENCE TASKS
TO2 Data resource Execute the data Develop elements of Evaluate possible data management, management strategy data management management strategies and data management strategy and data and combine them into Nowcasting and plan. management plan. data management plans, projections, taking into account Define technical Collect/web-scrape organizational specifics. Data requirements. required big data visualization, datasets. Discover relations, Setting up data Recognize big data propose optimization and warehouses, sources as useful for Select and execute the improvements. nowcasting. most appropriate Data storage, techniques for big Develop new models and Identify formats for data warehouse set- methods (e.g. for Data processing, relevant types of big data, ups. nowcasting) if necessary. Data conversion select the suitable data formats for concrete Execute and operate Evaluate outcomes of Nonresponse problem. statistical tasks using data processing. adjustment and selected data storage weighting, Identify data in different technologies. Create solutions and formats, select most methods for data Spatial appropriate techniques Use standard conversion in statistical analysis/GIS/cart for big data processing. technologies for big domain. ography data processing. Find possible solutions for Recommend and big data storage for Be able to use influence improvements Analysis of statistical purposes. standard methods and based on continuous data aggregated data, tools for big data analysis. Choose potential conversion, pulling Multivariate technologies for data together Discover hidden relations analysis, warehouses. heterogeneous data. via visualizations. Time series and seasonal Choose appropriate Identify necessary Create and optimize existing analytical methods and use visualizations to properly adjustment methods and operate them in combination if support decision making. existing tools to do necessary. specified big data Predict and evaluate the analysis. Develop big data differences between analysis application for different technological Present data in the specific datasets and solutions for concrete required form. tasks or processes. statistical problems.
Be able to select the Identify relations and appropriate software to provide consistent visualize big data. reports and visualizations.
62
Build visualizations for complex and variable data. ADMINISTRATIVE TASKS (FOR STATISTICAL PURPOSES) TO3 Big data quality Follow big data specific Develop and adapt big Evaluate and predict the assessment, quality frameworks and data specific success of big data assurance and quality assurance methodologies for specific quality compliance, methodologies. quality assurance for frameworks and quality particular applications assurance methodologies Big data project Follow the specified of big data and use of in the context of specific management action plan for a project. big data sources. projects.
Identify the elements Evaluate and predict the of action plan for a success of different specific project. strategies and action plans for projects.
Evaluate risks related to big data technologies.
INFORMATION TECHNOLOGIES TASKS (FOR STATISTICAL PURPOSES) TO4 System Provide requirement Be able to solve the Perform quality analysis Architecture, analysis, perform key issues in Software and system evaluation for architectural design and Design: Concurrency, big data in statistics. Hardware and requirements allocation, Control and Handling infrastructure, according to specific to of events, data Perform software big data applications to persistence, engineering Developing statistical production. distribution of management. software, components, error Systems and Compare traditional and exception Perform software software architecture handling, interaction evaluation and product requirements with and presentation, management. maintenance system architectural Security (in the area of requirements for big big data). Perform evaluation of the data. process of systems and Develop the software maintenance (in Provide software infrastructure and the area of big data in prototype mock-ups. allocate the relevant statistics). hardware components Compare traditional within developed software requirements infrastructure, taking with software into account big data requirements for big specifics. data.
63
Select the relevant Develop actual technologies for software for big data developing big data in statistics. software. Establish a process of Follow the established systems and software process of systems and maintenance in the software maintenance (in area of big data. the area of big data in statistical production). PROGRAMMING LANGUAGES TO5 R, Know the basics of R: Perform machine Optimize solutions in R, R console, data types and learning with R. Pyhon, Scala. Python, structures, exploring and Scala, visualizing data, Execute Spark jobs Adapt algorithms in R, programming structures, from R. Python, Scala. Javascript main functions of R base packages, most common Perform interactive Develop packages for R, external R packages data visualizations Python and Scala. (ggplot2, stringr, tidyr, with RShiny. dplyr, readr, data.table). Evaluate the benefits of Use Python for big using the specified Know how to work with a data analytics, programming language R scripting IDE (e.g. machine learning and for the concrete task. RStudio) and how to web scraping. execute R scripts. Actively use big data Be aware of the relevant packages, such as packages in R for data pbdR, Rmpi. science, big data and distributed computing: Actively use R-Analysis pbdR, rhdfs, rhbase, of Spatial Data. SparkR, sparklyr, tidytext Actively use packages Know the basics of in Python for big data, Pyhon: types and such as PySpark, structures, exploring and NumPy, SciPy, Pandas, visualizing data, and Scikit-learn, PySAL programming structures, for Spaical Data, functions, and data ClusterPy, relationships. GeoGrouper etc.
Be aware of the relevant Perform big data packages in Python for analysis with Scala. data science, big data and Working with RDD in distributed computing: Apache Spark using PySpark, NumPy, SciPy, Scala.
64
Pandas, and Scikit-learn, Working with PySAL for spacial data, DataFrame in Apache ClusterPy, GeoGrouper. Spark using Scala. Use jvmr package. Know the basics of Scala: Scala - basic syntax, Perform reduction environment setup, operations & data types, variables, distributed key-value classes & objects, pairs. operators, function, partitioning and traits, shuffling. Scala web frameworks, comparing Scala, java, Be able to combine Python and R in Apache traditional statistical Spark. data sources with IoT and other big data Know the basics of sources. Javascript: reading in data, combining data, Perform big data summarizing data, visualization with iterating and reducing, Javascript. nesting and grouping data, using Node. ARCHITECTURE TOOLS AND TECHNOLOGIES TO6 Distributed Understand the concept Be able to choose Identify problems, and Parallel of distributed computing, parallel computing explain, analyze, and architecture, strength and limitations, platforms for parallel evaluate various and applications in the applications. distributed systems Distributed statistical domain. solutions in the statistical computing, Be able to design and domain. Understand architecture develop distributed Distributed models and parallel systems and filesystems programming models. distributed systems applications for the statistical domain.
Be able to setup Hadoop cluster.
Be able to apply the fundamental computer science methods and algorithms in the development of distributed systems and distributed systems applications.
65
Be able to perform system testing in the statistical domain.
CLOUD TOOLS AND TECHNOLOGIES TO7 Cloud computing Understand the basic Identify the Identify problems and concepts and key architecture and explain, analyze, and technologies, strength infrastructure of cloud evaluate various cloud and limitations of cloud computing, including computing solutions in computing and possible SaaS, PaaS, IaaS, the statistical domain. applications of cloud public cloud, private computing in statistical cloud, hybrid cloud, domain. etc.
Choose the appropriate technologies, algorithms, and approaches for the related issues.
Provide the appropriate cloud computing solutions and recommendations according to the applications used in statistical domain.
UPPER-LEVEL DATA SCIENCE TOOLS AND TECHNOLOGIES TO8 Machine learning, Identify statistical Solve problems Perform evaluation of the problems that can benefit associated with batch selected technologies. Databases, from machine learning learning and online Understanding and other methods learning, and the big Understand the impact of applicable to big data. data characteristics big data for decisions and algorithms, such as high strategy in statistical Data mining, Identify the dimensionality, domain. characteristics of datasets dynamically growing Deep learning, and be able to spot big data and scalability Be able to predict the data in various issues. possible challenges that Artificial applications in the particular technologies intelligence, statistical domain. Develop scaling up bring for a specific task. Natural language machine learning processing, Understand the basic techniques (and other principles of machine technologies) and learning, data mining, associated computing
66
Stream artificial intelligence, techniques and processing and natural language technologies for analysis, processing, web various applications in technologies (and other statistical domain. IoT (Internet of technologies) and big Things), data in statistical domain. Implement various ways of selecting Multimedia Understand machine suitable model analysis, learning techniques, web parameters for Web technologies technologies (and other different machine technologies) and learning techniques. (Web scrapping) computing environment that are suitable for the Integrate machine applications in the learning libraries (and statistical domain. other technologies), and mathematical and Identify key challenges statistical tools with for machine learning (and modern technologies other technologies) in the like Hadoop statistical domain. distributed file system and MapReduce Use tools for big data programming model. analytics and present the analysis result. Be able to integrate various statistical and big data types and sources. STATISTICS AND BUSINESS INTELLIGENCE TO9 Apache Spark Have a general Extracting, processing, Evaluate the benefits of knowledge about Apache analyzing data, using Spark for statistical Spark and its possible visualizing data and purposes. applications for the performing machine statistical domain. learning with Spark.
Describe Spark’s Implement typical use fundamental mechanics. cases for Spark.
Build data pipelines and query large data sets using Spark SQL and DataFrames.
Learn how to work with RDD and Data Frames in Spark.
Analyse Spark jobs using the
67
administration UIs and logs inside Databricks.
Create Structured Streaming and Machine Learning jobs.
Understand Spark internals.
Use the core Spark APIs to operate on data.
DATA MANAGEMENT TO10 Apache Hive, Have a general Use advanced data Evaluate the benefits of knowledge about Apache structures with Hive. using big data Apache HBase, Hive, Pig, HBase, Sqoop management Apache Sqoop, and Cloudera Impala and Set up and load technologies for their possible applications partitioned tables. statistical purposes. Apache Pig, in the statistical domain. Use views to query data. Cloudera Impala Describe the tools' fundamental Create indexes for mechanisms. tables.
Differentiate Hive from Utilize HDFS to store traditional Relational and manage big data. Database Management Systems. Advanced Hive and HBase. Create and query tables. Create databases HBase Data Model, Import/add and delete HBase Shell, HBase data. Client API.
Know data techniques Perform real-time, using Sqoop. interactive analytical queries of the data Be able to set up and stored in HBase or customize Cloudera HDFS with Cloudera Manager to monitor and Impala. improve the performance of any size Hadoop cluster. DATABASES
68
TO11 Passively parallel- Have a broad Be proficient in SQL. Evaluate database processing understanding of development tools and databases (MPP), database concepts and Define, compare and programming languages database management use the four types of (for statistical purposes). DBMS, system software NoSQL Databases (Document-oriented, Understand the benefits SQL, Have a high-level KeyValue Pairs, using database NoSQL, understanding of major Column-oriented and technologies for DBMS components and Graph). statistical domain. MongoDB their function. Know the concepts of Understand the replication, differences between distribution, sharding, relational and non- and resilience in a relational databases. NoSQL database.
Understand how to Work with the choose a suitable common statistical database for an use-cases and application in a statistical architectures of domain. Mongo.
Know the basics of SQL, Use Mongo's built-in NoSQL, MongoDB. JavaScript interpreter.
Be comfortable with Query Mongo using queries and update Mongo's JSON-based languages. query language.
Index Mongo collections.
Handle data with Mongo's built-in MapReduce capabilities.
SEARCH TECHNOLOGIES
69
TO12 Search engines Get basic knowledge in Understand the Evaluate the the area of search based concepts of Apache appropriateness and the apps. Lucene & respective benefits of using a search APIs. engine approach for particular statistical Understand Apache applications. Solr.
Learn Indexing and searching using Solr. VISUALIZATION TECHNOLOGIES TO13 D3 (JavaScript), Effectively use data Design effective data Evaluate visualization Shiny (R), visualization tools. visualizations. outcomes. Bokeh (Python) Be aware of data Be able to effectively Be able to select an visualization libraries. work with D3: appropriate data selections, visualization tool/library SVGs, for statistical purposes. data binding, styling with D3, scaling with D3, interactive visualizations.
Be able to build Shiny apps, customize reactions and appearance.
Be able to effectively use Bokeh for statistical purposes. HADOOP TO14 Apache Hadoop, Master the concepts of Write Complex Evaluate the outcomes of RHIPE, HDFS and MapReduce MapReduce programs. performed task. YARN framework. Learn data loading Evaluate the benefits of Understand Hadoop 2.x techniques using using Hadoop Architecture. Apache Sqoop and technologies for Apache Flume statistical tasks. Perform data analytics using Pig, Hive and YARN.
Implement HBase and MapReduce integration.
70
Implement Advanced Usage and Indexing Schedule jobs using Oozie.
Implement best practices for Hadoop development.
Understand Spark and its Ecosystem.
Work on a real life project on big data analytics. DATA MINING TOOLS AND TECHNOLOGIES TO15 Apache Mahout, Be able to identify Know how to set up Perform evaluation of the BigInsights, problems that can be data for data mining selected technologies. BigML, addressed via data experiments. Google mining methods. Be able to predict the Prediction, Solve statistical possible challenges that LIBSVM, Understand data mining problem with specific particular technologies Orange, techniques. data mining tools. bring for a specific task. RapidMiner, Scikit-learn, Have a working Use advanced data Spark MlLib, knowledge of the mining options with Vowpal Wabbit, strengths and limitations specific libraries. Weka of modern data mining methods (algorithms).
Be able to select appropriate tool for data mining tasks.
5.4 Content development in the area of Big Data
5.4.1 ESTP content In Chapter 5 possible content sources relevant for the training objectives specified in Chapter 4 have been identified.
The European Statistical Training Programme (ESTP) contains a set of courses developed especially for statistical offices in Europe. Table 16 presents a set of identified ESTP courses that could possibly be relevant for the big data training programme.
Table 16: Related ESTP Courses ESTP course Date and time
71
Introduction to Big Data and its tools Date and time: 24 - 26 January 2017
Presentation, Facilitation and Consultation Skills for Statistical Date and time: 31 January - 2 Trainers – Introductory Course February 2017
The Use of R in Official Statistics Date and time: 4-7 April 2017
Can a Statistician become a Data Scientist? Date and time: 16 – 18 May 2017
Machine Learning Econometrics Date and time: 12 – 14 June 2017
Hands-on immersion on Big Data tools Date and time: 19 – 22 June 2017
Big data sources - Web, Social media and text analytics Date and time: 18 – 21 September 2017
Introduction to Linked Open Data Date and time: 28 – 29 September 2017
Automated collection of online prices: sources, tools and Date and time: 23 – 26 October 2017 methodological aspects
Advanced Big Data Sources - Mobile phone and other sensors Date and time: 6 – 9 November 2017
In Table 17 specific training objectives for each ESTP course were identified.
Table 17: ESTP Content Mapped to Training Objectives ESTP course Training objectives Introduction to Big Data and its tools TO1 (Privacy), TO2(Visualization technologies), TO6 (Distributed computing), TO14 (Hadoop) Presentation, Facilitation and Consultation Skills for Statistical TO1 (Soft skills, Delivery of results) Trainers – Introductory Course
The Use of R in Official Statistics TO5 (R)
Can a Statistician become a Data Scientist? TO2 (Data processing, Data visualization), TO8 (Web technologies/Web scrapping) Machine Learning Econometrics TO8 (Machine learning)
Hands-on immersion on Big Data tools TO6 (Distributed computing), TO9 (Spark), TO10 (Hive), TO11 (NoSQL), TO8 (Web technologies/Web scraping), TO14 (Hadoop)
72
Big data sources - Web, Social media and text analytics TO8 (Web technologies/Web scrapping), TO2(Data processing), TO8 (Data mining), TO8 (NLP) Introduction to Linked Open Data TO8 (Upper-level technologies)
Automated collection of online prices: sources, tools and TO8 (Web technologies/Web methodological aspects scrapping), TO2 (Nowcasting and projections) Advanced Big Data Sources - Mobile phone and other sensors TO8 (Multimedia analysis), TO2 (Visualization), TO8 (IoT), TO8 (Stream processing), TO5 (R), TO5 (Python)
5.4.2 Data science content dashboard A valid source of content is the EDSA dashboard [28] that provides a set of recent jobs and training materials for a specific query.
Figure 20: EDSA dashboard – Hadoop search
73
Figure 21: EDSA dashboard – Hadoop related trainings and videolectures Figure 20 and Figure 21 demonstrate the demand of jobs and supply of training materials in the area of data science (in Europe).
Following the defined training objectives, the content (in the form of MOOC courses and available online training materials for each objective) is specified in Table 18.
Table 18: Web Content Mapped to Training Objectives Web content Training objectives Communicating Business Analytics Results TO1 https://www.coursera.org/learn/communicating-business-analytics-results SOFT SKILLS
Oral Communication for Engineering Leaders https://www.coursera.org/learn/oral-communication
Research Report: Delivering Insights https://www.coursera.org/learn/research-report
Communicating Complex Information: Presenting Your Ideas Clearly and Effectively https://www.futurelearn.com/courses/communicating-complex- information?lr=29
74
Effective Problem-Solving and Decision-Making https://www.coursera.org/learn/problem-solving
Using Creative Problem Solving https://www.futurelearn.com/courses/creative-problem-solving?lr=20
Creative Leadership for Effective Leaders (Foundation Level) https://www.openlearning.com/courses/creativethinkingandcreativeproblems olving
Data Science Fundamentals TO2 https://bigdatauniversity.com/learn/data-science STATISTICAL AND DATA SCIENCE Where to go Online to Find the Data: Data Analysis for Journalists TASKS http://www.newsu.org/courses/find-online-data
Data Processing Using Python https://www.coursera.org/learn/python-data-processing
Interactive Data Visualization for the Web http://alignedleft.com/tutorials/d3
Data Visualization and D3.js https://www.udacity.com/course/data-visualization-and-d3js--ud507
Data Tells a Story: Reading Data in the Social Sciences and Humanities https://www.futurelearn.com/courses/data-explosion?lr=161
Building Data Visualization Tools https://www.coursera.org/learn/r-data-visualization
Fundamentals of Visualization with Tableau https://www.coursera.org/learn/data-visualization-tableau
Data Warehouse Concepts, Design, and Data Integration https://www.coursera.org/learn/dwdesign
Relational Database Support for Data Warehouses https://www.coursera.org/learn/dwrelational
Design and Build a Data Warehouse for Business Intelligence Implementation https://www.coursera.org/learn/data-warehouse-bi-building
Practical Predictive Analytics: Models and Methods https://www.coursera.org/learn/predictive-analytics
Introduction to Project Management TO3
75 https://www.coursesites.com/webapps/Bb-sites-course-creation- ADMINISTRATIVE BBLEARN/handleSelfEnrollment.htmlx?course_id=_239834_1 TASKS (FOR STATISTICAL Fundamentals of Project Planning and Management PURPOSES) https://www.futurelearn.com/courses/fundamentals-of-project-planning-and- management?lr=87
Software Product Management Capstone https://www.coursera.org/learn/software-product-management-capstone
Fundamentals of Management https://www.coursera.org/learn/fundamentals-of-management
Data Management and Visualization https://www.coursera.org/learn/data-visualization
Agile Planning for Software Products TO4 https://www.coursera.org/learn/agile-planning-for-software-products INFORMATION TECHNOLOGIES Software Processes and Agile Practices TASKS (FOR https://www.coursera.org/learn/software-processes-and-agile-practices STATISTICAL PURPOSES) Mastering Software Development in R Capstone https://www.coursera.org/learn/r-capstone
Software Debugging https://www.udacity.com/course/software-debugging-- cs259?utm_medium=referral&utm_campaign=api
Systems Thinking and Complexity https://www.futurelearn.com/courses/systems-thinking-complexity?lr=111
Data Science and Machine Learning Bootcamp with R TO5 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http PROGRAMMING s://www.udemy.com/data-science-and-machine-learning-bootcamp-with-r LANGUAGES
R for Data Science http://r4ds.had.co.nz
Statistics with R Capstone https://www.coursera.org/learn/statistics-project
R Programming https://www.coursera.org/learn/r-programming
The R Programming Environment https://www.coursera.org/learn/r-programming-environment
76
R for Data Science http://r4ds.had.co.nz
Python Data Science Handbook https://jakevdp.github.io/PythonDataScienceHandbook
Capstone: Retrieving, Processing, and Visualizing Data with Python https://www.coursera.org/learn/python-data-visualization
Programming for Everybody (Getting Started with Python) https://www.coursera.org/learn/python
Python Data Structures https://www.coursera.org/learn/python-data
Using Python to Access Web Data https://www.coursera.org/learn/python-network-data
Using Databases with Python https://www.coursera.org/learn/python-databases
Introduction to Data Science in Python https://www.coursera.org/learn/python-data-analysis
Official Python Tutorial https://docs.python.org/3/tutorial/index.html
Functional Programming Principles in Scala https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/progfun1
Scala by Example http://www.scala-lang.org/docu/files/ScalaByExample.pdf
Scala: Learn by Example https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/learn-by-example-scala
Scala School http://twitter.github.io/scala_school
Programming in Scala http://www.artima.com/pins1ed
Effective Scala http://twitter.github.io/effectivescala
Introduction to Programming and Problem Solving Using Scala
77 https://hackr.io/tutorial/introduction-to-programming-and-problem-solving- using-scala
Learning Scala - Joel Abrahamsson http://joelabrahamsson.com/learning-scala
Scala Exercises https://www.scala-exercises.org
Distributed Programming in Java TO6 https://www.coursera.org/learn/distributed-programming-in-java ARCHITECTURE TOOLS AND Advanced Operating Systems TECHNOLOGIES https://www.udacity.com/course/advanced-operating-systems-- ud189?utm_medium=referral&utm_campaign=api
Intro to Hadoop and MapReduce https://www.udacity.com/course/intro-to-hadoop-and-mapreduce-- ud617?utm_medium=referral&utm_campaign=api
Deploying a Hadoop Cluster https://www.udacity.com/course/deploying-a-hadoop-cluster-- ud1000?utm_medium=referral&utm_campaign=api
Software Architecture for the Internet of Things https://www.coursera.org/learn/iot-software-architecture
Cloud Computing Concepts, Part 1 TO7 https://www.coursera.org/learn/cloud-computing CLOUD TOOLS AND TECHNOLOGIES Cloud Computing Concepts: Part 2 https://www.coursera.org/learn/cloud-computing-2
Big Data, Cloud Computing, & CDN Emerging Technologies https://www.coursera.org/learn/big-data-cloud-computing-cdn
Cloud Computing Project https://www.coursera.org/learn/cloud-computing-project
Cloud Computing Applications, Part 1: Cloud Systems and Infrastructure https://www.coursera.org/learn/cloud-applications-part1
Cloud Networking https://www.coursera.org/learn/cloud-networking Algorithmic Thinking (Part 1) TO8 https://www.coursera.org/learn/algorithmic-thinking-1 UPPER-LEVEL DATA SCIENCE TOOLS Algorithmic Thinking (Part 2)
78 https://www.coursera.org/learn/algorithmic-thinking-2 AND TECHNOLOGIES Learn Algorithms by Solving Challenges https://www.learneroo.com/subjects/8
Introduction to Algorithms and Data structures in C++ https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/introduction-to-algorithms-and-data-structures-in-c
Machine Learning A-Z: Hands-On Python & R in Data Science https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/machinelearning
Machine Learning Foundations: A Case Study Approach https://www.coursera.org/learn/ml-foundations
Neural Networks for Machine Learning https://www.coursera.org/learn/neural-networks
Applied Machine Learning in Python https://www.coursera.org/learn/python-machine-learning
Machine Learning with Python https://hackr.io/tutorial/machine-learning-with-python
Practical Machine Learning https://www.coursera.org/learn/practical-machine-learning
Serverless Machine Learning with Tensorflow on Google Cloud Platform https://www.coursera.org/learn/serverless-machine-learning-gcp
Machine Learning https://www.coursera.org/learn/machine-learning
Machine Learning for Data Analysis https://www.coursera.org/learn/machine-learning-data-analysis
Machine Learning with Big Data https://www.coursera.org/learn/big-data-machine-learning
Machine Learning Foundations: A Case Study Approach https://www.coursera.org/learn/ml-foundations
Machine Learning: Classification https://www.coursera.org/learn/ml-classification
Intro to Machine Learning
79 https://www.udacity.com/course/intro-to-machine-learning-- ud120?utm_medium=referral&utm_campaign=api
Predictive Modeling and Analytics https://www.coursera.org/learn/predictive-modeling-analytics
Pattern Discovery in Data Mining https://www.coursera.org/learn/data-patterns
Data Mining Project https://www.coursera.org/learn/data-mining-project
Deep Learning http://www.deeplearningbook.org
Neural Networks and Deep Learning https://www.coursera.org/learn/neural-networks-deep-learning
Deep Learning https://www.udacity.com/course/deep-learning-- ud730?utm_medium=referral&utm_campaign=api
Applied Text Mining in Python https://www.coursera.org/learn/python-text-mining
Introduction to Natural Language Processing https://www.coursera.org/learn/natural-language-processing
Applied Social Network Analysis in Python https://www.coursera.org/learn/python-social-network-analysis
Social Media Analytics: Using Data to Understand Public Conversations https://www.futurelearn.com/courses/social-media-analytics?lr=137
Intro to Artificial Intelligence https://www.udacity.com/course/intro-to-artificial-intelligence-- cs271?utm_medium=referral&utm_campaign=api
Artificial Intelligence https://www.udacity.com/course/artificial-intelligence-- ud954?utm_medium=referral&utm_campaign=api
Learn the fundamentals of Artificial Intelligence http://www.awin1.com/cread.php?awinmid=6798&awinaffid=428263&p=https ://www.edx.org/course/artificial-intelligence-ai-columbiax-csmm-101x-0
MIT Open Courseware - Artificial Intelligence
80 https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6- 034-artificial-intelligence-fall-2010/lecture-videos
Intro to AI — UC Berkeley CS188 http://ai.berkeley.edu/lecture_videos.html
Advanced AI: Deep Reinforcement Learning in Python https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/deep-reinforcement-learning-in-python
Internet of Things & Augmented Reality Emerging Technologies https://www.coursera.org/learn/iot-augmented-reality-technologies
Big Data, Cloud Computing, & CDN Emerging Technologies https://www.coursera.org/learn/big-data-cloud-computing-cdn
Internet of Things: Setting Up Your DragonBoard™ Development Platform https://www.coursera.org/learn/internet-of-things-dragonboard
Internet of Things: How did we get here? https://www.coursera.org/learn/internet-of-things-history
Internet of Things Capstone: Build a Mobile Surveillance System https://www.coursera.org/learn/internet-of-things-capstone
Introduction to the Internet of Things and Embedded Systems https://www.coursera.org/learn/iot
Programming for the Internet of Things Project https://www.coursera.org/learn/internet-of-things-project
Embedded Systems https://www.udacity.com/course/embedded-systems-- ud169?utm_medium=referral&utm_campaign=api
Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform https://www.coursera.org/learn/leveraging-unstructured-data-dataproc-gcp
Taming Big Data with Apache Spark and Python TO9 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http STATISTICS AND s://www.udemy.com/taming-big-data-with-apache-spark-hands-on BUSINESS INTELLIGENCE Apache Spark in Python: Beginner's Guide https://www.datacamp.com/community/tutorials/apache-spark- python#gs.fMIIqxM
Apache Spark 2.0 with Scala
81 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/apache-spark-with-scala-hands-on-with-big-data
Scalable Programming with Scala and Spark https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/scalable-programming-with-scala-and-spark
Big Data Analysis with Scala and Spark https://www.coursera.org/learn/scala-spark-big-data
Statistical Analysis for Educational Re•searchers https://www.openlearning.com/courses/statisticalanalysis
Getting Started with Apache Cassandra TO10 https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http DATA s://www.udemy.com/apache-cassandra MANAGEMENT
Database Systems Concepts & Design TO11 https://www.udacity.com/course/database-systems-concepts-design-- DATABASES ud150?utm_medium=referral&utm_campaign=api
An Introduction to Database https://www.openlearning.com/courses/databaseanintroduction
Database Fundamentals http://www.microsoftvirtualacademy.com/training-courses/database- fundamentals
Database Management Essentials https://www.coursera.org/learn/database-management
Data Manipulation at Scale: Systems and Algorithms https://www.coursera.org/learn/data-manipulation
Intro to Relational Databases https://www.udacity.com/course/intro-to-relational-databases-- ud197?utm_medium=referral&utm_campaign=api
Relational Database Support for Data Warehouses https://www.coursera.org/learn/dwrelational
SQL basics by Khan Academy https://www.khanacademy.org/computing/computer-programming/sql/sql- basics
A beginners guide to thinking in SQL http://www.sohamkamani.com/blog/2016/07/07/a-beginners-guide-to-sql
82
Get Started with SQL Programming http://www.ntu.edu.sg/home/ehchua/programming/sql/MySQL_HowTo.html
SQL Tutorial by Tutorials Point http://www.tutorialspoint.com/sql/sql_tutorial.pdf
Learn SQL the Hard Way https://learncodethehardway.org/sql
Try SQL http://campus.codeschool.com/courses/try-sql/contents
Managing Big Data with MySQL https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/analytics-mysql
SQL for Newbies: Data Analysis for Beginners https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/sql-for-newbs
NoSQL Databases (General) http://www.christof-strauch.de/nosqldbs.pdf
Server-side Development with NodeJS, Express and MongoDB https://www.coursera.org/learn/server-side-nodejs
Web Application Development with JavaScript and MongoDB https://www.coursera.org/learn/web-application-development
Data Wrangling with MongoDB https://www.udacity.com/course/data-wrangling-with-mongodb-- ud032?utm_medium=referral&utm_campaign=api
The MongoDB Manual http://docs.mongodb.org/manual
The Complete Developer's Guide to MongoDB https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/the-complete-developers-guide-to-mongodb
The Little MongoDB Book http://openmymind.net/2011/3/28/The-Little-MongoDB-Book
MongoDB Tutorial for Beginners https://hackr.io/tutorial/mongodb-tutorial-for-beginners
MongoDB for Node.js Developers https://university.mongodb.com/courses/M101JS/about
83
React Native with an Express/MongoDB Backend https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/build-your-next-app-with-react-native-and-express
MongoDB for Beginners Tutorials https://hackr.io/tutorial/mongodb-for-beginners-tutorials
Text Retrieval and Search Engines TO12 https://www.coursera.org/learn/text-retrieval SEARCH TECHNOLOGIES Introduction to Search Engine Optimization https://www.coursera.org/learn/search-engine-optimization
Search Engine Optimization Fundamentals https://www.coursera.org/learn/seo-fundamentals
Building Data Visualization Tools TO13 https://www.coursera.org/learn/r-data-visualization VISUALIZATION TECHNOLOGIES Interactive Data Visualization for the Web http://alignedleft.com/tutorials/d3
Data Visualization and D3.js https://www.udacity.com/course/data-visualization-and-d3js--ud507
Big Data Integration and Processing TO14 https://www.coursera.org/learn/big-data-integration-processing HADOOP
Deploying a Hadoop Cluster https://www.udacity.com/course/deploying-a-hadoop-cluster-- ud1000?utm_medium=referral&utm_campaign=api
Hadoop Platform and Application Framework https://www.coursera.org/learn/hadoop
Hadoop Illuminated http://hadoopilluminated.com/index.html
Hadoop Tutorial http://www.tutorialspoint.com/hadoop/index.htm
Become a Hadoop Developer https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/hadoop-tutorial/
Hadoop Platform and Application Framework
84
https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/hadoop
The Ultimate Hands-On Hadoop https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=39197&murl=http s://www.udemy.com/the-ultimate-hands-on-hadoop-tame-your-big-data/
Hadoop Tutorial by Tutorials Point http://www.tutorialspoint.com/hadoop/hadoop_tutorial.pdf
Bigdata and Hadoop Tutorial https://hackr.io/tutorial/bigdata-and-hadoop-tutorial
Hadoop by Durga Software https://hackr.io/tutorial/hadoop-by-durga-software Data Mining Project TO15 https://www.coursera.org/learn/data-mining-project DATA MINING TOOLS AND Pattern Discovery in Data Mining TECHNOLOGIES https://www.coursera.org/learn/data-patterns
Data Visualization https://www.coursera.org/learn/datavisualization
Introduction to Data Science in Python https://click.linksynergy.com/deeplink?id=jU79Zysihs4&mid=40328&murl=http s://www.coursera.org/learn/python-data-analysis
Data Science Fundamentals https://bigdatauniversity.com/learn/data-science
6. Strategic analysis of bridging the gap via training
6.1 Training channels
6.1.1 Advantages and disadvantages of face-to-face training Face-to-face training courses, such as the ones provided by ESTP [29], are beneficial from the following perspectives:
Networking. A real life human interaction with another person provides an important training feature and leads to increases in networking feasibilities. Engagement. The face-to-face training proves to be more focused at concrete tasks in a concrete moment in time. Discussion. It is easy to have an open discussion in face-to-face training. Specificity. The face-to-face trainings can be adapted for concrete group’s requirements. Feedback. In a face-to-face training it is easy to obtain a feedback and help from instructor.
85
However, there are a number of disadvantages associated with face-to-face trainings, such as:
Unsuitable for some people. Face-to-face trainings can be unsuitable in terms of time constraints and costs. Unsuitable for large audiences. Face-to-face trainings cannot cover very large groups of people. Unsuitable for large organizations. Face-to-face training is less suitable for large organizations, since they have branches located at different locations. Low reference value. Face to face communication is oral; no written records are kept. Poor retention by listener. The listener can process more information than given at face-to-face training.
6.1.2 European Statistics Training Programme The European Statistical Training Programme (ESTP) [29] is the main training channel of Eurostat concerning European statistics. The ESTP contains a set of courses, some of which partially cover the specified training objectives. In particular, from Table 17 we can see that there are ESTP courses targeted at soft skills, data science and statistical tasks and at big data tools and technologies exist.
6.1.3 European Master in Official Statistics The European Master in Official Statistics (EMOS) [30] is a project aimed at developing a programme for training and education of potential future official statisticians within existing Master programmes at European universities.
EMOS-labelled Master is made up of four main parts:
- EMOS module (approx. 10% of ECTS credits);
- Semi-elective courses (approx. 30% of ECTS credits);
- Elective courses (approx. 25% of ECTS credits);
- Internship and Master thesis (approx. 35% of ECTS credits)
The EMOS provides students with an advanced training in the area of statistics in general and official statistics in particular. The project suggests complementary quantitative and statistical tools and enhances the abilities of students to understand and to be able to analyse European official statistics at different levels: quality, production process, dissemination, and analysis in a national, European and international context.
6.1.4 Online Learning E-learning theory describes the cognitive science principles of effective multimedia learning using electronic educational technology [31].
The possible online training channels are described below.
Webinars
A webinar [32] is an event held on the internet which is attended by an online audience. The video can be broadcasted along with a PowerPoint in sync and screen capture can be used.
86
A webinar is a form of one-to-many communication: a presenter can reach a large and specific group of online viewers from a single location. Webinars are widely attended. The participants use the following interactive opportunities:
Ask a question Chat Poll Survey Test Call to action Twitter
MOOCs
A massive open online course [33] is an online course aimed at unlimited participation and open access via the web. MOOCs provide recorded video lectures, problem sets, quizzes and other materials. The interaction is possible at user forums. MOOCs are characterized by massive enrollments. Consequently MOOCs require instructional design that facilitates large-scale feedback and interaction, such as peer- review, group collaboration and automated feedback through objective, online assessments, e.g. quizzes and exams.
Videolectures
Videolectures.net [34] is an award-winning free and open access educational video lectures repository. The lectures are given by distinguished scholars and scientists at the most important and prominent events like conferences, summer schools, workshops, and science promotional events from many fields of Science. The portal is aimed at promoting science, exchanging ideas and fostering knowledge sharing by providing high quality didactic contents not only to the scientific community, but also to the general public. All lectures, accompanying documents, information and links are systematically selected and classified through the editorial process taking into account also users' comments.
Table 19 presents a sample of videolectures related to the big data available in videolectures.net.
Table 19: Videolectures in the Area of Big Data Title Number of views Big-Data Tutorial 14132 views BigData and MapReduce with Hadoop 1937 views Big Data Clustering 1930 views Mining Big Data in Real Time 945 views On Big Data Algorithmics 799 views Text Analytics and Big Data 531 views Sampling for Big Data 385 views Big Data – Big opportunities – Big risks? And what about 253 views Europe? Making (Big) Data 153 views Technological challenges of Big-Data 64 views
87
Personalized learning portal
Personalized learning portal is a portal designed to provide educators, administrators and learners with a single robust, secure and integrated system to create personalized learning environments. The software provided by such platforms can be installed on web servers. Moodle [35] is a personalized learning portal designed to support both teaching and learning. It is free with no licensing fees, easy to use with multilingual capabilities. The Moodle project is well-supported by an active international community.
The EDSA online courses portal [36] is based on the Moodle Learning Management System. A Learning Management System (LMS) is an online software application offering facilities for student registration, enrolment into courses, delivery of learning materials to students, student assessment and progress monitoring. Moodle is an open-source learning platform designed to provide educators, administrators and learners with a single robust, secure and integrated system to create personalized learning environments. Moodle has been adopted by numerous educational institutions worldwide, including the Open University. Moodle currently has more than 79 million users across the academic and enterprise sectors. These figures make it the world’s most widely used learning platform. Additionally, as it is open source it has attracted a sizeable community of developers, which offers a wide range of free and open plugins that extend and enrich the functionalities provided by Moodle.
Figure 22 presents a snapshot of EDSA online courses portal.
Figure 22: EDSA online courses portal
88
Figure 23 presents an example of EDSA learning pathway that is automatically provided based on skills selected by the user. In such way, the user can follow their own learning pathways based on their qualifications and intentions.
Figure 23: EDSA learning pathways
6.1.5 Other possible training channels Other possible training channels include training workshops and self-study materials. Training workshops have the benefits of face-to-face training and, at the same time, they empower the collaboration and networking aspects of the trainings.
Blended training format includes both elements of face-to-face trainings supported with online learning.
Training workshops
A training workshop is a type of interactive training where participants carry out a number of training activities rather than passively listen to a lecture or presentation [37].
Self-study materials
An electronic book (or e-book) is a digital book, consisting of text, images, or both, readable on the flat- panel display of computers or other electronic devices. eBook (ePUB format) [38] is available to download and use even without an internet connection on iPads and iPhones (iBooks format), as well as other tablets and smartphones (ePUB format).
89
6.2 Defining strategic training plan for Big Data in official statistics
6.2.1 ADDIE model ADDIE is an instructional system design framework for training developers who are developing courses [39, 40]. ADDIE model includes a number of phases, such as Analysis, Design, Development, Implementation and Evaluation in details described below.
ADDIE model has a number of specific goals to be achieved:
Evaluation of the trainees needs; Design and development of training materials; Trainees reach the training objectives and achieve the defined learning outcomes; Evaluation of the training process.
6.2.2 Analysis Initial analysis and assessment of training needs is one of the important stages of ADDIE model. Previous sections (Section 2) covered an analysis of skills required for working with big data tasks in statistical domain, while Section 3 and Section 4 provided big data training needs from the perspective of focal points in statistical offices. The training objectives have been defined in Section 5.
The outcomes for the strategic training plan are usually grouped as short, medium and long term outcomes, which can be further expanded (Table 20):
Table 20: Expected Outcomes for Big Data in Statistics Outcomes Change in Explanation Short Knowledge The relevant personnel in statistical offices around Skills Europe will get awareness about existence of big data Attitude and their possible applications in statistics. Motivation Awareness The change of attitude should happen for personnel of NSIs with regards to using new and emerging technologies in statistics.
The relevant NSI personal will get skills for working with big data depending on their background, motivation and training needs of particular NSI – statisticians at the level of Familiarity, IT experts at the level of Usage, Managers at the level of Assessment.
Medium Behaviors The level of expertise of working with big data will be Practices constantly supported in NSIs around Europe. Policies Procedure The behavior and practices changes around using of big data and other emerging technologies for statistics will be observed.
90
The new standards, policies and procedures will be taken.
Long Situation – environment, The overall situation in statistical domain in Europe social and economic will change – where the knowledge and new conditions technologies would drive the statistical production.
6.2.3 Design In the Design phase of training development, it is defined how the training courses should look in order to meet the needs from the Analysis phase.
In particular, for the purposes of big data training for NSIs in Europe, the Design phase should answer the questions:
How short-term, medium-term and long-term training programme should be implemented? Which topics/objectives could be covered via face-to-face trainings (ESTP, workshops) in short- term, medium-term and long-term? Which online mechanisms can be implemented in short-term, medium-term and long-term? Which topics/objectives should have blended training format (face-to-face with supportive online materials) in short-term, medium-term and long-term?
For the specific courses:
How courses content should be organized? How should ideas be presented to participants? What delivery formats should be used? What types of activities and exercises will be more suitable for participants? How should the trainees be evaluated?
The basic steps for a specific course design are the following:
Planning the instructional strategy Selecting the course format Writing the instructional design document
6.2.4 Development The development phase where training contents are created and assembled, course materials are produced according to decisions made during the design and analysis phases.
It includes determining and developing appropriate activities and evaluation. The Development phase of the specific course preparation can be broken down into the following five components:
Review/revise existing information sources/training materials; Selecting appropriate methods and media; Developing all new course material; Validating course materials; and Developing an Instructional Management Plan
91
In particular, content suggested as part of deliverable D4.4 could provide the basics for course development in the area of big data.
6.2.5 Implementation The Implementation phase follows the Development phase and ensures that
The course meets important business goals; The course covers content that learners need to know; The course reflects the learners existing capabilities.
The implementation phase contains a number of common issues related to face-to-face training and a set of issues related to online training or e-learning.
Common Issues
Course Materials
How many copies of the course materials need to be printed? Will course materials be printed in-house or outsourced to a printer? How will course materials be delivered and who will be responsible?
Instructors
How many trainers will be needed for the project? Will the trainers come from an in-house team or from an outside provider? Will the project require the trainers to travel? Should the trainers be geographically-based? How will the instructors learn to teach this course? Will the project require a train-the-trainer session? When and how will trainers receive their schedule? Who will be the technical contact for trainers? Can enhanced/leveraged use of multimedia training/partnerships be included?
Course Schedule
Where will the courses be offered? On what dates and times will the course be offered? How will this schedule be communicated?
Classroom Space
Will the classroom require any specific technology—computers, light box, etc.? Will the classroom require desks, tables or just chairs?
Registration
How will learners be enrolled for the course? How will course rosters be tracked? How will rosters be communicated to instructors?
92
How will instructors record attendance and test scores? Will this course be entered into a learning management system?
Logistics
Who will manage training administration? Who will manage training logistics? Who will be responsible for collecting and communicating these statistics?
E-learning Issues
Hosting
Where will the course be hosted? How many learners will need to access the course in total? How many learners will need to access the course at any one time?
Access
How will learners enroll for the course? Will learners be able to access the course through the web or will they need to connect to an intranet?
Learners’ Computers
Who will ensure all sites have internet-ready computers? Who will ensure that learners have all necessary applications loaded onto to their computers? Will learners need to download any applications or plug ins?
The implementation of training strategy is performed based on decisions taken in design and analysis phases described above.
6.2.6 Evaluation Seminar/outcome can be measured in one of the following metrics:
Were the goals met as set out in the analysis phase? Was there an improvement in a set of skills observed? Was there an increase in the training attendance?
93
7. References [1] Chun‑Wei Tsai, Chin‑Feng Lai, Han‑Chieh Chao. et al., Big data analytics: a survey. Journal of Big Data (2015) 2: 21, doi:10.1186/s40537-015-0030-3.
[2] Nada Elgendy, Ahmed Elragal, Big Data Analytics: A Literature Review. Advances in Data Mining. Applications and Theoretical Aspects, Vol. 8557 (2014), pp. 214-227, doi:10.1007/978-3-319-08976-8_16.
[3] Kubick, W.R.: Big Data, Information and Meaning. In: Clinical Trial Insights (2012), pp. 26–28.
[4] Paul MacDonnell and Daniel Castro. Europe Should Embrace the Data Revolution. Center for Data Innovation (2016), http://www2.datainnovation.org/2016-europe-embrace-data-revolution.pdf.
[5] IDG Enterprise Data and Analytics Survey 2016, http://core0.staticworld.net/assets/2016/06/29/idge- data-analysis-2016.pdf.
[6] EDSA project, http://edsa-project.eu (accessed in January 2017).
[7] SARO ontology, http://eis.iai.uni-bonn.de/Projects/SARO.html (accessed in January 2017).
[8] BDVA reports, http://www.bdva.eu/?q=big-data-reports (accessed in January 2017).
[9] O'Reilly's 2016 Data Science Salary Survey, http://www.oreilly.com/data/free/2016-data-science- salary-survey.csp?intcmp=il-data-free-lp-lgen_free_reports_page.
[10] Adzuna API, https://developer.adzuna.com/overview (accessed in January 2017).
[11] L. Ratinov, D. Roth, D. Downey, and M. Anderson, Local and global algorithms for disambiguation to Wikipedia. ACL (2011).
[12] JSI Wikifier, http://wikifier.org (accessed in January 2017).
[13] GeoNames ontology, http://www.geonames.org/ontology/documentation.html (accessed in January 2017).
[14] Microsoft Academic Graph, https://www.microsoft.com/en-us/research/project/microsoft- academic-graph (accessed in April, 2017).
[15] Ontogen tool, ontogen.ijs.si (accessed in January 2017).
[16] Letheby R.S, Nicholson D., The ABS statistical capability framework – the first step in transforming the statistical capabbility learning environment, http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.54/2014/Australia_The_ABS_Statis tical_Capability_Framework_01.pdf (accessed January 2017).
[17] EDISON Project: Building the Data Science Profession [online] http://edison-project.eu
[18] B. S. Bloom, M. D. Engelhart, E. J. Furst, W. H. Hill, D. R. Krathwohl (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. New York: David McKay Company.
[19] D. A. Kolb. Experiential learning: experience as the source of learning and development. Prentice-Hall, 1984.
94
[20] T. W. Wlodarczyk, T. J. Hacker, "Problem-Based Learning Approach to a Course in Data Intensive Systems." Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on. IEEE, 2014.
[21] The Aalborg Model for Problem Based Learning (PBL) [online] http://www.en.aau.dk/education/problem-based-learning
[22] J. Biggs, “Enhancing teaching through constructive alignment,” Higher education, vol. 32, no. 3, pp. 347–364, 1996.
[23] M. Ben-Ari, “Constructivism in computer science education,” Journal of Computers in Mathematics and Science Teaching, vol. 20, no. 1, pp. 45–73, 2001.
[24] Data Science Competence Framework [online] http://edison-project.eu/data-science-competence- framework-cf-ds
[25] Information Technology Competency Model of Core Learning Outcomes and Assessment for Associate-Degree Curriculum (2014) http://www.capspace.org/uploads/ACMITCompetencyModel14October2014.pdf
[26] A. Sasha Thackaberry, A CBE Overview: The Recent History of CBE [online] http://evolllution.com/programming/applied-and-experiential-learning/a-cbe-overview-the-recent- history-of-cbe
[27] Computer Science 2013: Curriculum Guidelines for Undergraduate Programs in Computer Science http://www.acm.org/education/CS2013-final-report.pdf
[28] EDSA Dashboard [online] http://jobs.videolectures.net
[29] ESTP program [online] http://ec.europa.eu/eurostat/web/european-statistical-system/training- programme-estp
[30] EMOS project [online] http://www.cros-portal.eu/content/emos
[31] R.E. Mayer, R. Moreno (1998). A Cognitive Theory of Multimedia Learning: Implications for Design Principles (PDF).
[32] What is a webinar? [online] https://www.webinar.nl/en/webinars/what-is-a-webinar
[33] Massive open online course [online] https://en.wikipedia.org/wiki/Massive_open_online_course
[34] VideoLectures.NET [online] http://videolectures.net
[35] About Moodle [online] https://docs.moodle.org/34/en/About_Moodle
[36] EDSA courses portal [online] http://courses.edsa-project.eu
[37] R. L. Jolles (2005). How to Run Seminars and Workshops (3 ed.). John Wiley & Sons. pp. 5, 12, 48, 155, 320. ISBN 978-0-471-71587-0. Retrieved 2014-11-23.
95
[38] E-book [online] https://en.wikipedia.org/wiki/E-book
[39] G. R. Morrison (2010). Designing Effective Instruction, 6th Edition. John Wiley & Sons.
[40] Strategic Training Plan [online] http://www.nj.gov/dep/transformation/enforcement/docs/092811trn/Strategic%20Training%20Plan%2 09-27-2011.pdf
[41] I. Novalija, M. Grobelnik. Deliverable D4.1: Report containing a description of the skills required to process and analyse big data sources for the purpose of official statistics. April, 2017.
[42] I. Novalija, M. Grobelnik. Deliverable D4.2 and D4.2: Report containing an analysis of the existing skills in the statistical offices of the ESS, Eurostat and NSIs (Deliverable 4.2), and Report containing an analysis of the training needs of the statistical offices of the ESS, Eurostat and NSIs (Deliverable 4.3). September, 2017.
[43] I. Novalija, M. Grobelnik. Report containing training objectives and content ready to be used in the design of a training programme to supply the statistical offices in Europe with the skills required to use big data sources in statistical production (Deliverable 4.4). November, 2017.
[44] I. Novalija, M. Grobelnik. Report containing a strategic analysis of how the skills gap can be bridged via training (Deliverable 4.5). November, 2017.
96
8. Annex
8.1 Appendix: Recommended Literature Sources in the Area of Big Data Credit card data:
[Sobolevsky,et.al.](2015)Predicting Regional Economic Indices using Big Data of Individual Bank Card Transactions
Mobile network data:
[Ahas,et.al](2008)Evaluationg passive mobile in tourism
[Blondel,Decuyper,Krings](2015)A survey of results on mobile phone datasets analysis
[Bogomolov,et.al](2014)Once Upon a Crime - Towards Crime Prediction from Demographics and Mobile Data
[Csaji,et.al.](2014)Exploring the Mobility of Mobile Phone Users
[Decuyper, et.al.](2014)Estimating Food Consumption and Poverty indices with Mobile Phone Data
[Deville,et.al.](2014)Dynamic population mapping using mobile phone data
[Diminescu,Licoppe,Smoreda,Ziemlicki](2006)Using Mobile Phone Geolocation Data for the Analysis of Patterns of Coordination
[Fillekes](2014)Reconstructing trajectories from sparse call detail records
[Frias-Martinez,Frias-Martinez,Oliver](2010)A Gender-Centric Analysis of Calling Behavior in a Developing Economy Using Call Detail Records
[Frias-Martinez,Soguero-Ruiz,Josephidou](2013)Forecasting Socioeconomic Trends With Cell Phone Records
[Furletti,et.al](2014)Use of mobile phone data to estimate mobility flows
[Furletti,Gabrielli,Renso,Rinzivillo](2013)Analysis of GSM calls data for understanding user mobility behaviour
[GP](2015)Mapping the risk-utility landscape of mobile phone data
[Hongyan,Fasheng](2013)Estimating freeway traffic measures from mobile phone location data
[Kanasugi,et.al.](2013)Spatiotemporal route estimation consistent with human mobility using cellular network data
[Licoppe,et.al.](2008)Using mobile phone geolocalisation for socio-geographical analysis
[Monsted,Mollgard,Mathiesen](2016)Phone-based Metric as a Predictor for Basic Personality Traits
[Montjoye,et.al.](2013)Predicting Personality Using Novel Mobile Phone-Based Metrics
[Pappalardo, et.al.](2014)Human Mobility, Social Networks and Economic Development
97
[Pei,et.al](2014)A New Insight into Land Use Classification Based on Aggregated Mobile Phone Data
[Reades,Reades,Ratti](2009)Eigenplaces - analyzing cities using the space-time structure of the mobile phone network
[Rey-del-Castillo,CardeĄosa](2016)An Exercise in Exploring Big Data for Producing Reliable Statistical Information
[Sapiezynski,et.al.](2015)Tracking Human Mobility Using WiFi Signals
[Soto,Frias-Martinez,Virseda](2011)Prediction of Socioeconomic Levels using Cell Phone Records
[Toole,et.al.](2015)Tracking Employment Shocks Using Mobile Phone Data
[Vanhoof](2014)PhD 6 months report
Mobile phone & wearables sensors data:
[AAPOR](2014)Mobile Technologies for Conducting, Augmenting and Potentially Replacing Surveys
[Bouwman,Heerschap,Reuver](2013)Smartphone measurement study 2012
[Fernee,Sonck,Scherpenzeel](2013)Data collection with smartphones - experiences in a time use survey
[Mastrandrea,Fournet,Barrat](2015)Comparison between Data Collected Using Wearable Sensors, Contact Diaries and Friendship Surveys
[Neverova,et.al.](2016)Learning Human Identity from Motion Patterns
[Parslow](2014)How big data could be used to predict a patient's future
Network data:
[Benedictis,Tajoli](2010)Comparing sectoral international trade networks
[Benedictis,Tajoli](2011)The World Trade Network
[Iapadre,Tajoli](2014)Emerging countries and trade regionalization - A network analysis
[Kurka,Godoy,Zuben](2016)Online Social Network Analysis
[Mandel](2014)Connections as a tool for growth - evidence from the LinkedIn economic graph
[Piccardi,Tajoli](2015)Are Preferential Agreements Significant for the World Trade Structure - A Network Community Analysis
Text Analytics:
[Blei](2012)Probabilistic Topic Models
[Gillick,et.al.](2016)Multilingual Language Processing From Byte
[Rehurek,Kolkus](2009)Language Identification on the Web - Extending the Dictionary Method
[Schakel,Wilson](2015)Measuring Word Significance using Distributed Representations of Words
98
[Sonoda,Daisuke](2015)Predicting Latent Trends of Labels in the Social Media Using Infectious Capacity
[Spaniol,Prytkova,Weikum](2013)Knowledge Linking for Online Statistics
Web data:
[AAPOR](2014)Social Media in Public Opinion Research
[Antenucci,et.al.](2014)Using Social Media to Measure Labor Market Flows
[Arrington](2006)Google Trends Launches
[Askitas,Zimmermann](2009)Google Econometrics and Unemployment Forecasting
[Bacchini,et.al](2014)Does Google index improve the forecast of Italian labour market
[Banbura,et.al.](2013)Now-casting and the real-time data flow
[Barreira,et.al](2013)Nowcasting with Google Trends in an Emerging Market
[Beiro](2016)Predicting human mobility through the assimilation of social media traces into mobility models
[Berg](2013)Evaluating Quality of Online Behavior Data
[Breton,et.al](2015)Research indices using web scraped data
[Bughin](2011)Nowcasting the Belgian Economy
[Butler](2013)When Google got flu wrong
[Carriere,Labbe](2010)Nowcasting with google trends in an emerging market
[Carriere,Labbe](2013)Nowcasting with google trends in an emerging market
[Chadwik,Sengul](2012)Nowcasting unemployment rate in turkey let's ask Google
[Chamberlin](2010)Googling the present
[Choi,Varian](2009)Predicting Initial Claims for Unemployment Benefits
[Choi,Varian](2009)Predicting the Present with Google Trends
[Choi,Varian](2012)Predicting the Present with Google Trends
[Choudhury,et.al](2010)Sampling Impact on Discovery of Information Diffusion in Social Media
[Compton, Jurgens, Allen](2014)Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization
[Cook,et.al](2011)Assessing Google Flu Trends Performance
[Curti,Iacus,Porro](2015)Measuring Social Well Being in The Big Data Era - Asking or Listening
[Daas,Puts](2014)Social media sentiment and consumer confidence
[DAmuri,Marcucci](2009)Forecasting US unemployment with Google job search index
99
[DAmuri,Marcucci](2009)Google it - forecasting the US unemployment rate with google job search index
[Dombrovskyi](2014)Using internet search data for nowcasting unemployment rate in Ukraine
[Ettredge,Gerdes,Karuga](2005)Using Web-based Search Data to Predict Macroeconomic Statistics
[European Commission](2010)Internet as data source
[Falorsi,Naccarato,Pierini](2015)Using google trend data to predict the Italian unemployment rate
[Fondeur,Karame](2013)Can google data help predict French unemployment
[Fung](2014)Google Flu Trends Failure Shows Good Data better than Big Data
[Gayo-Avello](2012)A Balanced Survey on Election Prediction using Twitter Data
[Ginsberg,et.al](2009)Detecting influenza epidemics using search engine query data - Supplementary information 1
[Ginsberg,et.al](2009)Detecting influenza epidemics using search engine query data
[Hamid,Heiden](2014)Forecasting Volatility with Empirical Similarity and Google Trends
[Kapounek](2016)Determinants of Foreign Currency Savings - Evidence from Google Search Data
[Kholodilin,Podstawski,Siliverstovs,Bürgi](2009)Google Searches as Means of Improving Nowcasts of Macroeconomic Variables
[Kholodilin,Podstawski,Siliverstovs](2010)Do Google searches help in nowcasting private consumption
[Koop,Onorante](2013)Macroeconomic Nowcasting Using Google Probabilities
[Kuhn,Skuterud](2004)Internet Job Search and Unemployment Durations
[Lampos,et.al.](2015)Advances in nowcasting influenzalike illness rates using search query logs
[Lazer, Kennedy, King, Vespignani](2014)The Parable of Google Flu - Traps in big data analysis
[Long,Shen](2014)Population specialization and synthesis with open data
[Mao,Counts,Bollen](2015)Quantifying the effects of online bullishness on international financial markets
[McIver, Brownstein](2014)Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time
[Miao,Ma](2015)The Dynamic Impact of Web Search Volume on Product Sales - An Empirical Study Based on Box Office Revenues
[Milinovich,et.al.](2014)Using internet search queries for infectious disease surveillance
[Mohebbi,et.al](2011)Google Correlate Whitepaper
[Olson,et.al](2013)Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza
[Preis,Moat](2014)Adaptive nowcasting of influenza outbreaks using Google searches
100
[Rivera](2015)Dynamic model to forecast hotel registrations using Google Trends data
[Rubin,Puranmalka](2014)Google insights into FNMA MBS prepayments
[Santillana,et.al](2014)What can disease detection learn from (an external revision to)Google Flu Trends
[Schmidt,Vosen](2011)Forecasting private consumption - survey-based indicators vs Google trends
[Seo,et.al.](2014)Cumulative Query Method for Influenza Surveillance Using Search Engine Data
[Shimshoni,Efron,Matias](2009)On the Predictability of Search Trends
[Siddiqui](2015)Mining wikipedia to rank rock guitarists
[Stilo,Vincenzi,Tozzi,Velardi](2013)Automated Learning of Everyday Patients Language for Medical Blogs Analytics
[The Economist](2014)The Economist explains - The backlash against big data
[Toth,Hajdu](2013)Google as a tool for nowcasting household consumption - estimation on Hungarian data
[Vicente,Menéndez,Pérez](2014)Forecasting unemployment with internet search data - Does it help to improve predictions when job destruction is skyrocketing
[Vosen,Schmidt](2011)Forecasting private consumption survey based indicators vs Google Trends
[Vosen,Schmidt](2012)A monthly consumption indicator for Germany based on Internet search query data
[Wang.et.al.](2014)Forecasting elections with non-representative polls
[Xiaoxuan](2016)Tourism forecasting by search engine data with noise-processing
[Zagheni,Kiran,State](2014)Inferring International and Internal Migration Patterns from Twitter Data
[Zeynalov](2014)Nowcasting Tourist Arrivals to Prague
Wikipedia:
[Ciglan,NŤrvćg](2010)WikiPop - Personalized Event Detection System Based on Wikipedia Page View Statistics
[Cozza,Petrocchi,Spognardi](2016)A matter of words - NLP for quality evaluation of Wikipedia medical articles
[Eom,et.al.](2015)Interactions of Cultures and Top People of Wikipedia from Ranking of 24 Language Editions
[Guisado-Gámez,Prat-Péres](2015)Understanding Graph Structure of Wikipedia for Query Expansion
[Katz,Shapira](2015)Enabling Complex Wikipedia Queries
[Khan,Khan,Mahmood](2015)Cloud service for assessment of news' Popularity in internet based on Google and Wikipedia indicators
101
[McIver, Brownstein](2014)Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time
[Milne,Witten](2012)An open-source toolkit for mining Wikipedia
[Munzert](2015)Using wikipedia page views statistics to measure issue salience
[Navarrete,Borowiecki](2015)Change in access after digitization - Ethnographic collections in Wikipedia
[Pohl](2012)Improving the wikipedia miner word sense disambiguation algorithm
[Yasseri,Bright](2015)Wikipedia traffic data and electoral prediction - towards theoretically informed models
[Yucesoy,Barabasi](2015)Untangling Performance from Success
102
8.2 Appendix: Trending Skills by Groups
Statistical tasks
Sampling
Legal acts Legal
Microdata
Calculation
Data access Data
Aggregation
Data analysis Data
Selection bias Selection
Quality control Quality
Data processing Data
Quality reporting Quality
Statistical surveys Statistical
Statistical content Statistical control Disclosure
Statistical systems Statistical
EUnomenclatures
Statistical analyses Statistical
Statistical software Statistical
Technical standards Technical
Statistical Indicators Statistical
Statistical databases Statistical
Multivariateanalysis
Seasonal adjustment Seasonal
Estimation techniques Estimation
Imputation techniques Imputation
Administrative sources Administrative
Model-based estimation Model-based
Statistical confidentiality Statistical Geographical information… Geographical Nowcasting and projections and Nowcasting
Administrative tasks for statistical purposes
TaskForces
Projectmonitoring
Quality assessment Quality
Administrative rules Administrative
Communication and Communication
information strategy information
Contractnegotiation
Peoplemanagement
andtechniques
Interservice consultation Interservice
European Statistical System Statistical European
Decision-making procedures Decision-making Communication instruments Communication Inter-institutional procedures Inter-institutional
103
Budget tasks for statistical purposes
Contract management Proposals writing Public Procurement Financial regulation and procedures
IT tasks for statistical purposes
Testing
Security
Training
Maintenance
Development
Implementation
Customersupport
System Architecture System
Statistical databases Statistical
Documentation writing Documentation Analysis of requirements of Analysis
Hardware and infrastructure Hardware
104
Data science tasks
prototype
dashboard
data search data
data sharing data
data storage data
data capture data
data cleaning data
data analysis data
data transfer data
data platform data
data querying data
data curation data
data modelling data
data conversion data
data warehouse data
data governance data
data visualization data data management data data standartization data
Architecture
HPCC
MIKE2.0
5C architecture 5C
distributed databases distributed
distributed computing distributed
data intensive data systems
distributed filesystems distributed
data intensive data computing High-Performance Computing High-Performance distributed parallel architecture parallel distributed
105
Data management technologies
Toad
Redis
Neo4J
Splunk
BigQuery
Cassandra
Couchbase
Apache Pig Apache
Apache Hive Apache
Apache Storm Apache
Apache Oozie Apache
Apache Mesos Apache
Apache Flume Apache
Apache HBase Apache
Apache Sqoop Apache
Apache Phoenix Apache
Cloudera Impala Cloudera
Amazon RedShift Amazon Apache ZooKeeper Apache Amazon DynamoDB Amazon
Data mining tools
H2O
Weka
BigML
Orange
LIBSVM
Scikit-learn
Big Insights Big
Spark MlLib Spark
RapidMiner
Vowpal Wabbit Vowpal Apache Mahout Apache Google Prediction Google
106
Databases technologies
sql
DB2
DBMS
SQLite
mysql
Oracle
NoSQL
Vertica
RDBMS
Netezza
Redshift
Teradata
database
mongodb
SAP HANA SAP
SQL Server SQL
PostgreSQL
Oracle Exascale Oracle
query languages query
EMC(Greenplum)
scripting languages scripting
Aster Data (Teradata) Data Aster
network-attached storage… network-attached
storage area network (SAN) network storagearea massively parallel-processing… massively direct-attached storage (DAS) direct-attached
Upper-level technologies
web service web
data mining data
deep learning deep
stream analysis stream
digital footprint digital
network analysis network
machine learning machine
stream processing stream
inductive statistics inductive
artificialintelligence
businessintelligence
software development software social network analysis socialnetwork
natural language processing natural
107
Hadoop
Apache HDFS Hadoop YARN Apache MapReduce Cloudera RHIPE
Programming Languages
C
C#
Go
c++
ECL
perl
java
Julia
Bash
scala
Ruby
Octave
python
IPython javascript VisualBasic
108
Search technologies
search-based applications ElasticSearch Solr Lucene
Statistics and business intelligence
R
SAS
SPSS
Dato
Excel
Stata
pbdR
matlab
Alteryx
Cognos
Pentaho
QlikView
Power BI Power
Oracle BI Oracle
Jaspersoft
PowerPivot
Mathematica
Apache Spark Apache
Microstrategy Adobe Analytics Adobe BusinessObjects
109
Visualization technologies
D3
Shiny
Plotly
NVD3
Bokeh
ggplot
Leaflet
InfoVis
Chart.js
Tableau
Visual.ly
Sigma JS Sigma
Infogram
n3-charts
Polymaps
Chartist.js
Matplotlib
Highcharts
ZoomData
ChartBlocks
Processing.js
FusionCharts Datawrapper Ember Charts Ember
Soft skills
Logic
Ethics
Initiative
Teamwork
Leadership
Negotiation
Coordination
Communication
expertise
Delivery of Delivery results
awareness
Information privacy Information
Specialist knowledge and knowledge Specialist Creative problem solving problem Creative Innovation and contextual Innovation
110
8.3 Appendix: Correlated Skills by Groups Data science tasks 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000
0
ios
php
unix
html
d3.js
nosql
html5
devops
node.js
hadoop
analysis
compiler
architect
mongodb
assurance
developer
leadership
simulation
sharepoint
unit testing unit
automation
data mining data
data analysis data
elasticsearch
virtualization
web analytics web
data modeling data
version control version
relational database relational
artificial intelligence artificial
businessintelligence
software engineering software software architecture software software development software
Administrative tasks for statistical purposes 18000 16000 14000 12000 10000 8000 6000 4000 2000
0
sql
ios
c++
mysql
jquery
analyst
devops
finance
asp.net
analysis
security
vmware
statistics
architect
selenium
database
hardware
postgresql
unit testing unit
automation
device driver device
virtualization
web analytics web
machine code machine
software design software
troubleshooting
machine learning machine
riskmanagement
data visualization data
software engineer software
relational database relational
businessintelligence product management product responsive web design web responsive
111
Budget tasks for statistical purposes 12000
10000
8000
6000
4000
2000
0
sql
c++
dba
php
dojo
linux
cloud
mysql
debian
java ee java
finance
robotics
compiler
database
metadata
leadership
monitoring
automation
data science data
data analysis data
web analytics web
machine code machine
version control version
user experience user
data conversion data
database design database
software engineer software
integration testing integration
businessintelligence
software architecture software
productmanagement responsive web design web responsive functionalprogramming
IT tasks for statistical purposes 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000
0
ios
dba
php
perl
mysql
design
mobile
matlab
javaee
finance
backend
statistics
angularjs
database
javascript
prototype
postgresql
linkeddata
automation
data mining data
data science data
elasticsearch
apachespark
user interface user
data modeling data
troubleshooting
customersupport
scripting language scripting
software engineer software
integration testing integration
relational database relational amazon web amazon services continuous integration continuous
112
Statistics and business intelligence 12000
10000
8000
6000
4000
2000
0
dba
html
soap
sales
nosql
cloud
mysql
sqoop
oracle
javaee
finance
analysis
security
robotics
pentaho
statistics
analytics
cloudera
database
postgresql
simulation wordpress
automation
web analytics web
user interface user
data modeling data
version control version
data conversion data
business objects business
machine learning machine
businessintelligence
productmanagement distributed computing distributed software development software
Visualization technologies 90 80 70 60 50 40 30 20 10
0
c++
xml
perl
.net
pmp
scipy
oracle
jquery
github
design
ember
joomla
numpy
matlab
analyst
node.js
jasmine android
analysis
network
statistics
angularjs
selenium
javascript mongodb
postgresql
data mining data
user interface user
machine code machine
.netframework
computerscience
data management data
businessintelligence
project management project
amazon web amazon services model view controller view model functionalprogramming
113
Search technologies 1200
1000
800
600
400
200
0
ios
php
perl
json
ruby
redis
cloud
mysql
hbase
scrum
oracle
design
lucene
devops
node.js
android
database
mongodb
hardware
metadata
postgresql
unit testing unit
virtualization
machine code machine
data modeling data
version control version
.netframework
software design software
data warehouse data
reverseengineering
artificialintelligence
distributed computing distributed software development software continuous integration continuous
Programming languages 16000
14000
12000
10000
8000
6000
4000
2000
0
git
sql
ruby
bash
redis
nginx
nosql
xhtml
debian
javaee
python
puppet
asp.net
security
vmware
statistics
analytics
compiler
selenium
javascript
hibernate
openstack
automation
virtualization
bioinformatics
.netframework
data warehouse data
machine learning machine
scripting language scripting
software engineer software
project management project
software engineering software distributed computing distributed software development software
114
Hadoop 4000 3500 3000 2500 2000 1500 1000 500
0
pig
c++
php
perl
unix
html
linux
nosql
html5
mysql
hbase
javaee
ansible
analyst
analysis
statistics
angularjs
database
javascript
cassandra
developer
leadership
monitoring
mapreduce
apachespark
machine code machine
virtualmachine
software design software
data warehouse data
businessobjects
data visualization data
relational database relational functionalprogramming complexprocessing event
Upper level technologies 30000
25000
20000
15000
10000
5000
0
ios
perl
java
html5
oracle
javaee
node.js
analysis
analytics
compiler
architect
javascript mongodb
metadata
assurance
simulation
automation
data mining data
elasticsearch
device driver device
virtualization
web analytics web
user interface user
troubleshooting
data warehouse data
machine learning machine
imageprocessing
computerscience
scripting language scripting
integration testing integration
amazon web amazon services
productmanagement continuous integration continuous functional programming functional
115
Databases technologies 120000
100000
80000
60000
40000
20000
0
c++
dba
php
json
html
bash
sybase
debian
matlab
javaee
devops
node.js
analysis
backend
statistics
compiler
database
metadata
simulation
wordpress
linkeddata
unit testing unit
data mining data
apachespark
user interface user
data modeling data
troubleshooting
data warehouse data
database design database
machine learning machine
data management data
integration testing integration software engineering software software architecture software
Data mining tools 450 400 350 300 250 200 150 100 50
0
sql
c++
xml
perl
unix
linux
html5
numpy
android
vmware
polymer
statistics
analytics
compiler
magento
angularjs
selenium
assurance
wordpress
unit testing unit
automation
data analysis data
web analytics web
microservices
virtualmachine
user experience user
data conversion data
data visualization data
scripting language scripting
software engineer software
relational database relational
businessintelligence
software engineering software software architecture software productmanagement
116
Data management technologies 3000
2500
2000
1500
1000
500
0
html
cloud
hbase
sqoop
design
lucene
ansible
python
vmware
pentaho
analytics
cloudera
architect
database
rabbitmq
mongodb
cassandra
leadership
mapreduce
elasticsearch
microservices
apachecamel
machine code machine
version control version
virtualmachine
machine learning machine
imageprocessing
data visualization data
scripting language scripting
software engineer software
data management data
businessintelligence amazon web amazon services
functionalprogramming
Cloud technologies 2500
2000
1500
1000
500
0
git
java
html
bash
nosql
oracle
design
docker
devops
node.js
firewall
hadoop
architect
database
hardware
openstack
postgresql
amazon s3 amazon
mapreduce
data analysis data
microservices
machine code machine
.netframework
high availability high
software design software
troubleshooting
scripting language scripting
data management data
integration testing integration
relational database relational
continuous delivery continuous
project management project distributed computing distributed software development software
117
Architecture 35000
30000
25000
20000
15000
10000
5000
0
pig
c++
xml
php
json
redis
nosql
html5
storm
jquery
design
debian
analyst
finance
hadoop
security
robotics
architect
database
mongodb
assurance
openstack
postgresql
monitoring
automation
data modeling data
version control version
software design software
machine learning machine
data management data
integration testing integration
reverseengineering artificial intelligence artificial project management project
Soft skills 4500 4000 3500 3000 2500 2000 1500 1000 500
0
git
ios
c++
xml
php
.net
nosql
cloud
html5
debian
mobile
javaee
asp.net
security
statistics
analytics
database
mongodb
hardware
prototype
openstack
leadership
linkeddata
user interface user
software design software experience user
machine learning machine
riskmanagement
computerscience
software engineer software
data management data
amazon web amazon services software architecture software
functionalprogramming
118
8.4 Appendix: Correlated Skills for Statistical Tools and Technologies
Adobe Analytics 200 180 160 140 120 100 80 60 40 20
0
3d
git
ios
xml
perl
java
html
linux
arcgis
hybris
oracle
drupal
design
debian
firewall
tableau
analysis
pentaho
compiler
magento
database
metadata
prototype
coldfusion
postgresql
wordpress
automation
web crawler web
data analysis data
.netframework
businessobjects artificialintelligence
functionalprogramming
Alteryx 30
25
20
15
10
5
0
sql
c++
perl
mysql
design
matlab
analysis
robotics
statistics
analytics
cloudera
database
metadata
leadership
simulation
data mining data
data science data
data analysis data
apachespark
data modeling data
data warehouse data
amazon redshift amazon
machine learning machine
data visualization data
data management data
relational database relational
reverseengineering
artificialintelligence
businessintelligence project management project software development software
119
Apache Spark 450 400 350 300 250 200 150 100 50
0
sql
c++
perl
unix
html
d3.js
laser
sales
redis
lamp
nosql
cloud
html5
sqoop
impala
python
apache
asp.net
hadoop
analysis
statistics
analytics
database
javascript
cassandra
assurance
prototype
postgresql
asp.net mvc asp.net
semantic web semantic
software design software
data warehouse data
artificialintelligence
software architecture software productmanagement software development software
BusinessObjects 180 160 140 120 100 80 60 40 20
0
perl
.net
html
nosql
oracle
owasp
finance
analysis
pentaho
analytics
database
metadata
leadership
mapreduce
data science data
data analysis data
.netframework
software design software experience user
database design database
businessobjects
machine learning machine
data visualization data
data management data
businessintelligence
project management project
software architecture software distributed computing distributed software development software
120
Cognos 300
250
200
150
100
50
0
etl
php
perl
unix
html
nosql
cloud
neo4j
mysql
hadoop
ipython
analytics
cloudera
compiler
architect
selenium
database
metadata
peoplesoft
sharepoint
linked data linked
monitoring
data modeling data
data conversion data
data warehouse data
computerscience
data management data
relational database relational
artificialintelligence
businessintelligence
amazon web amazon services software engineering software productmanagement
Excel 1200
1000
800
600
400
200
0
r
sql
x86
xml
.net
java
json
html
linux
stata
nosql spark
cloud
oracle
debian
python
android
statistics
postgresql
peoplesoft
monitoring
web analytics web
user interface user
troubleshooting
database design database
customersupport
scripting language scripting
relational database relational
reverseengineering
artificialintelligence
project management project
amazon web amazon services distributed computing distributed software development software
121
Jaspersoft 8
7
6
5
4
3
2
1
0
sql
xml
perl
.net
soap
sales
mysql
javaee
analyst
asp.net
statistics
database
data mining data
data analysis data
user interface user
.netframework
machine learning machine
data management data
reverseengineering
artificialintelligence
businessintelligence
software engineering software productmanagement software development software
Mathematica 30
25
20
15
10
5
0
sql
c++
perl
laser
nosql
mysql
design
matlab
python
analysis
statistics
database
mongodb
metadata
leadership
data mining data
data science data
data analysis data
machine code machine
machine learning machine computerscience
software development software
122
Matlab 400
350
300
250
200
150
100
50
0
r
tcl
git
ios
java
laser
sales
scipy
xilinx
spark nosql
boost
mysql
oracle
design
debian
directx
android
polymer
analytics
architect
javascript
prototype
developer
automation
data analysis data
device driver device
user interface user
machine code machine
grid computing grid
.netframework
database design database
machine learning machine
imageprocessing
computerscience software engineer software
project management project
Microstrategy 120
100
80
60
40
20
0
sql
html
sales
nosql
sqoop
github
design
fortran
analysis
statistics
analytics
compiler
sharepoint
data science data
businessobjects
software engineer software
relational database relational
reverse engineering reverse
artificialintelligence
businessintelligence
project management project
software engineering software software architecture software productmanagement
123
Oracle BI 1600
1400
1200
1000
800
600
400
200
0
jira
c++
dba
java
json
cloud
html5
jquery
drupal
mobile
matlab
javaee
devops
asp.net
android
analysis
vmware
analytics
architect
angularjs
database
hibernate
mapreduce
virtualization
version control version
data visualization data
computerscience
data management data
relational database relational
artificialintelligence
businessintelligence
software engineering software software development software functionalprogramming
Pentaho 70
60
50
40
30
20
10
0
sql
sas
c++
xml
php
perl
storm
sqoop
jquery
design
python
hadoop
analysis
symfony
analytics
database
mongodb
developer
leadership
mapreduce
data mining data
data science data
elasticsearch
.netframework
data visualization data
businessintelligence
project management project
amazon web amazon services software engineering software software development software
124
Power BI 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000
0
r
ios
java
unix
html
sales
nosql
mysql
storm
hbase
sybase
fortran
javaee
devops
tableau
hadoop
analytics
architect
postgresql
sharepoint
data mining data
data science data
virtualization
development
web analytics web
version control version
machine learning machine
computerscience
relational database relational
businessintelligence
project management project
software architecture software distributed computing distributed responsive web design web responsive
PowerPivot 50 45 40 35 30 25 20 15 10 5
0
sql
perl
mysql
finance
analysis
statistics
compiler
database
leadership
mapreduce
automation
data mining data
data analysis data
device driver device
web analytics web
data modeling data
data warehouse data
data visualization data
reverseengineering
artificialintelligence
businessintelligence software development software functionalprogramming
125
QlikView 140
120
100
80
60
40
20
0
sql
pig
c++
json
html
storm
design
finance
robotics
pentaho
database
leadership
mapreduce
automation
data mining data
web analytics web
user interface user
.netframework
software design software
computerscience
data management data
artificialintelligence
productmanagement software development software
complexprocessing event
R 60
50
40
30
20
10
0
c++
perl
java
d3.js
linux
nosql
mysql
design
javaee
finance
asp.net
analysis
robotics
statistics
compiler
architect
javascript
hardware
metadata
assurance
simulation
sharepoint
linkeddata
data mining data
data analysis data
device driver device
virtualization
development
cryptography
version control version
machine learning machine
riskmanagement
artificialintelligence businessintelligence
functionalprogramming
126
SAS 600
500
400
300
200
100
0
c++
x86
jade
linux
sales
sybase
impala
fortran
node.js
finance
analysis
security
atlassian
statistics
database
leadership
confluence
automation
data analysis data
apachespark
bioinformatics
.netframework
machine learning machine
riskmanagement
regression testing regression
scripting language scripting
relational database relational
amazon web amazon services
software engineering software distributed computing distributed software development software
SPSS 900 800 700 600 500 400 300 200 100
0
r
sas
.net
html
scipy
spark nosql
scrum
scrapy
design
matlab
analysis
security
statistics
analytics
database
mongodb
javascript
metadata
monitoring
development
web analytics web
machine code machine
bioinformatics
.netframework
businessobjects
riskmanagement
computerscience
data management data software architecture software software development software
127
Stata 70
60
50
40
30
20
10
0
r
sql
c++
html
scipy
arcgis
mysql
design
numpy
matlab
finance
hadoop
analysis
security
vmware
robotics
statistics
compiler
database
metadata
data mining data
data science data
data analysis data
machine code machine
data modeling data
virtualmachine
machine learning machine
data visualization data
computerscience
scripting language scripting
data management data reverse engineering reverse project management project
128
8.5 Appendix: Popularity of Tools and Technologies from Literature Analysis
Statistics and BI
Search Tools and Technologies
search-based applications ElasticSearch Solr Lucene
129
Programming Languages
Data Mining
130
Hadoop
Apache MapReduce Apache HDFS Hadoop YARN RHIPE Cloudera
Upper-level Technologies
131
Databases
Architecture
132
8.6 Appendix: Big Data Training Needs Questionnaire Form BIG DATA TRAINING NEEDS
QUESTIONNAIRE
Q1. What skills need to be acquired?
Q2. What data sources should be covered? (Sensor data, Social media data, Financial transaction data, Web-scraped data etc.)
Q3. How many staff members need training?
133
Q4. Are the training needs the same for all types of staff members? (Statistician in a particular domain, General statistical methodologist, IT expert etc.)
Q5. By when do they need it?
Q6. What are the priorities? (with respect to skills, data sources, time constraints, training delivery methods etc.)
Thank you. That is the end of the questions.
134
8.7 Appendix: Learning outcomes defined for CF-DS competences and different mastery/proficiency levels LO ID Data Science LO by Knowledge levels (compliant to ACM classification) and Competence key verbs Familiarity Usage Assessment Choose, Classify, Apply, Analyze, Adapt, Assess, Collect, Compare, Build, Construct, Change, Configure, Contrast, Develop, Combine, Define, Demonstrate, Examine, Compile, Describe, Execute, Experiment with, Compose, Explain, Find, Identify, Infer, Conclude, Identify, Illustrate, Inspect, Model, Criticize, Create, Label, List, Match, Motivate, Decide, Deduct, Name, Omit, Organize, Select, Defend, Design, Operate, Outline, Simplify, Solve, Discuss, Recall, Rephrase, Survey, Test for, Determine, Show, Summarize, Visualize Disprove, Tell, Translate Evaluate, Imagine, Improve, Influence, Invent, Judge, Justify, Optimize, Plan, Predict, Prioritize, Prove, Rate, Recommend, Solve Data Science Data Analytics (DSDA) LO1-DA DSDA-DA Choose appropriate Develop data Create formal Use existing analytical analysis model for the appropriate method and operate application for specific data analytics existing tools to do specific data sets organizational and statistical specified data and tasks or tasks and techniques on analysis. Present processes. processes and available data data in the required Identify necessary use it to to discover form. methods and use discover hidden new relations them in relations, and deliver combination if propose insights into necessary. optimization research Identify relations and problem or and provide improvements. organizational consistent reports Develop new processes and and models and support visualizations. methods if decision- necessary. making. Recommend and influence organizational
135
improvement based on continuous data analysis.
LO1.01 DSDA01 Choose and execute Identify existing Design and Effectively use existing data requirements and evaluate variety of data analytics and develop predictive predictive analytics predictive analytics analysis tools. analysis tools to techniques, tools. discover new such as relations. Machine Learning (including supervised, unsupervised, semi- supervised learning), Data Mining, Prescriptive and Predictive Analytics, for complex data analysis through the whole data lifecycle LO1.02 DSDA02 Choose and execute Select most Assess and Apply standard methods appropriate optimize designated from existing statistical organization quantitative statistical libraries to techniques and processes using techniques, provide overview. model available statistical including data to deliver techniques. statistics, time insights. series analysis, optimization, and simulation to deploy appropriate models for analysis and prediction LO1.03 DSDA03 Operate tools for Analyze available Assess, adapt, Identify, complex data data sources and and combine extract, and handling. develop tool that data sources to pull together
136
available and work with improve pertinent complex datasets. analytics heterogeneous data, including modern data sources such as social media data, open data, governmental data LO1.04 DSDA04 Name and use basic Use multiple Evaluate and Understand performance performance and recommend the and use assessment metrics accuracy metrics, most different and tools. select and use appropriate performance most appropriate metrics, propose and accuracy for specific type of new for new metrics for data analytics applications. model application. validation in analytics projects, hypothesis testing, and information retrieval LO1.05 DSDA05 Define data elements Develop Design Develop necessary to develop specialized specialized required data specified data analytics to analytics to analytics for analytics. enable decision- improve organizational making. decision-making. tasks, integrate data analytics and processing applications into organization workflow and business processes to enable agile decision making LO1.06 DSDA06 Choose and execute Build Create and Visualise standard visualizations for optimize results of data visualization. complex and visualizations to analysis, design variable data. influence dashboard and
137
use storytelling executive methods decisions. Data Science Engineering LO2-ENG DSENG - Use Identify and operate Model problems Evaluate engineering instruments and and develop new instruments and principles and applications for data instruments and applications to modern collection, analysis applications for optimize data computer and management data collection, collection, technologies to analysis and analysis and research, management management. design, following implement established new data engineering analytics principles. applications; develop experiments, processes, instruments, systems, infrastructures to support data handling during the whole data lifecycle. LO2.01 DSENG01 Choose potential Model data Create Use technologies to analytics innovative engineering develop, structure, application to solution to principles instrument, better develop research and (general and machines, suitable design data software) to experiments, instruments, analytics research, processes, and machines, design, develop systems. experiments, and implement processes, and new systems. instruments and applications for data collection, storage, analysis and visualisation LO2.02 DSENG02 Name computational Apply existing Adapt and Develop and solution and identify computational optimize existing apply potential data solutions to data computational computational analytics platform analytic platform. solutions to and data driven better fit to a solutions to given data
138
domain related analytics problems using platform. wide range of data analytics platforms, with the special focus on Big Data technologies for large datasets and cloud based data analytics platforms LO2.03 DSENG03 Identify a set of Survey various Evaluate and Develop and potential data specialized data recommend prototype analytics tools to fit analytics tools and optimal data specialised specification. identify the best analytics tools to data analysis option. influence applicaions, decision making. tools and supporting infrastructures for data driven scientific, business or organisational workflow; use distributed, parallel, batch and streaming processing platforms, including online and cloud based solutions for on-demand provisioned and scalable services LO2.04 DSENG04 Find possible Model the Predict the Develop, database solutions problem to apply difference in deploy and including both database term of operate large relational and non- technology. performance scale data relational databases. between storage and relational and processing non-relational
139
solutions using databases and different recommend a distributed and solution. cloud based platforms for storing data (e.g. Data Lakes, Hadoop, Hbase, Cassandra, MongoDB, Accumulo, DynamoDB, others) LO2.05 DSENG05 Identify security Analyze security Evaluate security Consistently issues related to threats and solve threats and apply data reliable data access. them using known recommend security techniques. adequate mechanisms solutions. and controls at each stage of the data processing, including data anonymisation, privacy and IPR protection. LO2.06 DSENG06 Define technical Apply existing Combine several Design, build, requirements for SQL/NoSQL techniques and operate SQL/NoSQL databases, Data optimize them relational and databases, Data Warehouse to design new or non-relational Warehouse technologies for custom databases (SQL technologies for data creating data environment to and NoSQL), ingest. pipelines. integrate integrate them existing DW and with the database modern Data technologies for Warehouse new type of data solutions, and analytic ensure applications. effective ETL (Extract, Transform, Load), OLTP, OLAP processes for large datasets Data Science Data Management (DSDM)
140
LO3-DM DSDM-DM Execute data Develop Create Data Develop and strategy in a form of components of Management implement Data Management data strategy and Plan aligned data Plan and illustrate methods that with the management how available improve quality, organizational strategy for software can help to accessibility and needs, evaluate data collection, promote data quality publications of IPR and ethical storage, and accessibility. data. issues. preservation, and availability for further processing. LO3.01 DSDM01 - Explain and execute Develop Assess various Develop and data strategy in a components of data strategies implement form of Data data strategy in a and create data strategy, Management Plan. form of Data strategy, in a in particular, in Management form of Data a form of Data Plan. Management Management Plan, aligned Plan (DMP). with organizational needs. LO3.02 DSDM02 - Operate data models Experiment with Evaluate and Develop and including metadata. data models and design data implement model relevant models, relevant data metadata. including models, metadata. including metadata. LO3.03 DSDM03 - Collect different data Survey and Compose Collect and sources. visualize different data integrate connection sources to different data between different enable further source and data sources. analysis. provide them for further analysis. LO3.04 DSDM04 - Operate a historical Construct a Improve or Develop and data repository. historical data design a maintain a repository. historical data historical data repository. repository of analysis results (data provenance). LO3.05 DSDM05 - Illustrate how Develop methods Improve quality, Ensure data available software that improve accessibility and quality, can help to promote quality,
141
accessibility, data quality, accessibility and publications of publications accessibility and publications of data. (data curation). publications. data. Data Science Research Methods and Project Management (DSRMP) LO4-RMP DSRM Match elements of Apply scientific or Evaluate Create new scientific or similar similar method methodologies understandings method and identify and develop to optimize the and capabilities appropriate actions action plans to development of by using the for organizational translate organizational scientific strategy to create organizational objectives. method new capabilities. strategies to (hypothesis, create new test/artefact, capabilities. evaluation) or similar engineering methods to discover new approaches to create new knowledge and achieve research or organisational goals LO4.01 DSRM01 Match elements of Apply scientific Evaluate various Create new scientific or similar method to create methods and understandings method to a given a new predict which by using the problem understandings method can research and capabilities. optimize methods creation of new (including understandings hypothesis, and capabilities. artefact/experi ment, evaluation) or similar engineering research and development methods LO4.02 DSRM02 Choose observable Apply systematic Combine several Direct facts from an existing study toward a methods to systematic study for a better fuller knowledge discover new study toward understanding. or understanding approaches to understanding of the observable achieve of the facts. organizational observable goals.
142
facts, and discovers new approaches to achieve research or organisational goals LO4.03 DSRM03 Formulate and test Create full Analysis domain Analyse hypothesis for experiment to test related models domain related specified task or hypothesis for and propose research research question. domain specific analytics process model, task or methods, identify and experiment suggest new analyse data or improve available data quality of used to identify data. research questions and/or organisational objectives and formulate sound hypothesis LO4.04 DSRM04 Show creativity Develop creative Adapt common Undertake under guidance of a solutions using systematic creative work, senior staff in systematic investigation to making discovering and investigation or design and plan systematic use revising knowledge. experimentation creative work to of investigation to revise and discover or or discover revise experimentatio knowledge. knowledge. n, to discover or revise knowledge of reality, and uses this knowledge to devise new applications, contribute to the development of organizational objectives
143
LO4.05 DSRM05 Illustrate outstanding Identify non- Recommend Design ideas to solve standard solutions cost effective experiments complex problems. to solve complex solution to a which include problems. complex data collection problem. (passive and active) for hypothesis testing and problem solving LO4.06 DSRM06 Identify appropriate Develop actions Recommend Develop and actions for a given and action plan to effective action guide data project plan or translate plans to driven projects, experiment. strategies into translate including actionable plan. strategies, project suggest new planning, data to improve experiment effectiveness. design, data collection and handling Business Process Management LO5-BA DSDK Match elements of a Model business Evaluate various Use domain mathematical problems into an methods to knowledge framework to a abstract predict which (scientific or given business mathematical method can business) to problem and operate framework and optimize solving develop data support identify critical business relevant data services for other points which problems and analytics organizational roles. influence recommend applications; development of strategies that adopt general organizational optimize the Data Science objectives. development of methods to organizational domain objectives. specific data types and presentations, data and process models, organisational roles and relations
144
LO5.01 DSBA01 Match elements of a Model an Evaluate various Analyse mathematical unstructured methods and information framework to a given business problem predict which needs, assess business problem. into an abstract method can existing data mathematical optimize solving and framework. business suggest/identif problems. y new data required for specific business context to achieve organizational goal, including using social network and open data sources LO5.02 DSBA02 Match data to Analyze services Assess and Operationalise specification of to develop data improve use of fuzzy concepts services. specification. data in services. to enable key performance indicators measurement to validate the business analysis, identify and assess potential challenges LO5.03 DSBA03 Identify appropriate Identify critical Recommend Deliver actions for points which strategies that business management and influence optimize the focused organizational development of development of analysis using decisions. organizational organizational appropriate objectives. objectives. BA/BI methods and tools, identify business impact from trends; make business case as a result of organisational
145
data analysis and identified trends LO5.04 DSBA04 Operate data support Develop data Optimize data Analyse services for other support services support services opportunity organizational roles. for other for other and suggest organizational organizational use of roles. roles. historical data available at organisation for organizational processes optimization LO5.05 DSBA05 Summarize customer Survey and Recommend Analyse data. visualize customer actions based on customer data. data analysis to relations data improve to customer optimise/impro relations. ve interacting with the specific user groups or in the specific business sectors LO5.05 DSBA06 Access and use Identify data that Suggest new Analyse external open data bring value to marketing multiple data and social network used analytics for models based on sources for data. marketing. Use existing and marketing cloud based external data. purposes; solutions. identify effective marketing actions
146