Ref. Ares(2017)5087922 - 18/10/2017

SoBigData – 654024 www.sobigdata.eu

Project Acronym SoBigData SoBigData Research Infrastructure Project Title Social Mining & Big Data Ecosystem

Project Number 654024

Deliverable Title Data Scientists Training Materials 1

Deliverable No. D4.4

Delivery Date 30 April 2017

Giles Greenway (KCL), Tobias Blanke (KCL), Marco Braghieri Authors (KCL)

SoBigData receives funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 654024 SoBigData – 654024 www.sobigdata.eu

DOCUMENT INFORMATION

PROJECT

Project Acronym SoBigData Project Title SoBigData Research Infrastructure Social Mining & Big Data Ecosystem Project Start 1st September 2015 Project Duration 48 months Funding H2020-INFRAIA-2014-2015 Grant Agreement No. 654024 DOCUMENT Deliverable No. D4.4 Deliverable Title Data Scientists Training Materials Contractual Delivery Date 30 April 2017 Actual Delivery Date 18 October 2017 Author(s) Giles Greenway (KCL), Tobias Blanke (KCL), Marco Braghieri (KCL) Editor(s) Marco Braghieri (KCL), Valerio Grossi (CNR) Reviewer(s) Valerio Grossi (CNR), Anna Monreale (UNIPI), Paolo Ferragina (UNIPI), Beatrice Rapisarda (CNR) Contributor(s) Giles Greenway (KCL), Dominic Rout (USFD), Valerio Grossi (CNR), Anna Monreale (UNIPI) Work Package No. WP4 Work Package Title Training Work Package Leader KCL Work Package Participants CNR, USFD, UNIPI, FRH, UT, IMT, LUH, KCL, SNS, AALTO, ETHZ, TUDelft Dissemination Public Nature Report Version / Revision V1.0 Draft / Final Draft Total No. Pages 23 (including cover) Keywords Training Materials, Data Scientists

D4.4 Data Scientists Training Materials 1 Page 2 of 23 SoBigData – 654024 www.sobigdata.eu

DISCLAIMER

SoBigData (654024) is a Research and Innovation Action (RIA) funded by the European Commission under the Horizon 2020 research and innovation programme.

SoBigData proposes to create the Social Mining & Big Data Ecosystem: a research infrastructure (RI) providing an integrated ecosystem for ethic-sensitive scientific discoveries and advanced applications of social data mining on the various dimensions of social life, as recorded by “big data”. Building on several established national infrastructures, SoBigData will open up new research avenues in multiple research fields, including mathematics, ICT, and human, social and economic sciences, by enabling easy comparison, re-use and integration of state-of-the-art big social data, methods, and services, into new research.

This document contains information on SoBigData core activities, findings and outcomes and it may also contain contributions from distinguished experts who contribute as SoBigData Board members. Any reference to content in this document should clearly indicate the authors, source, organisation and publication date.

The document has been produced with the funding of the European Commission. The content of this publication is the sole responsibility of the SoBigData Consortium and its experts, and it cannot be considered to reflect the views of the European Commission. The authors of this document have taken any available measure in order for its content to be accurate, consistent and lawful. However, neither the project consortium as a whole nor the individual partners that implicitly or explicitly participated the creation and publication of this document hold any sort of responsibility that might occur as a result of using its content.

The European Union (EU) was established in accordance with the Treaty on the European Union (Maastricht). There are currently 27 member states of the European Union. It is based on the European Communities and the member states’ cooperation in the fields of Common Foreign and Security Policy and Justice and Home Affairs. The five main institutions of the European Union are the European Parliament, the Council of Ministers, the European Commission, the Court of Justice, and the Court of Auditors (http://europa.eu.int/).

Copyright © The SoBigData Consortium 2015. See http://project.sobigdata.eu/ for details on the copyright holders.

For more information on the project, its partners and contributors please see http://project.sobigdata.eu/. You are permitted to copy and distribute verbatim copies of this document containing this copyright notice, but modifying this document is not allowed. You are permitted to copy this document in whole or in part into other documents if you attach the following reference to the copied elements: “Copyright © The SoBigData Consortium 2015.”

The information contained in this document represents the views of the SoBigData Consortium as of the date they are published. The SoBigData Consortium does not guarantee that any information contained herein is error-free, or up to date. THE SoBigData CONSORTIUM MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, BY PUBLISHING THIS DOCUMENT.

D4.4 Data Scientists Training Materials 1 Page 3 of 23 SoBigData – 654024 www.sobigdata.eu

GLOSSARY

ABBREVIATION DEFINITION

Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. It has interfaces to many Python system calls and libraries, as well as to various window systems, and is extensible in C or C++. It is also usable as an extension language for applications that need a programmable interface. Python runs on many Unix variants, on the Mac, and on Windows 2000 and later.

R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, R classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible.

GitHub is a development platform. It allows users which range from GitHub to business, to host and review code, manage projects, and build software alongside in a participative environment.

The Jupyter Notebook is an open-source web application that allows Jupyter you to create and share documents that contain live code, equations, visualizations and explanatory text.

GATE is an open-source software which focuses on text processing and GATE includes a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process.

Apromore is an online business process analytics platform which Apromore includes a wide range of process mining analytics. It is an open-source tool which is available both via cloud or download.

D4.4 Data Scientists Training Materials 1 Page 4 of 23 SoBigData – 654024 www.sobigdata.eu

TABLE OF CONTENT

DOCUMENT INFORMATION ...... 2 DISCLAIMER ...... 3 GLOSSARY ...... 4 TABLE OF CONTENT ...... 5 DELIVERABLE SUMMARY ...... 6 EXECUTIVE SUMMARY ...... 7 1 Relevance to SoBigData ...... 9 1.1 Purpose of this document ...... 9 1.2 Relevance to project objectives ...... 9 1.3 SOBIGDATA project description ...... 9 1.4 Relation to other workpackages ...... 10 1.5 Structure of the document ...... 10 2 Report on Training Activities: T2 – Training Modules ...... 11 3 Interactive Training Environments ...... 12 4 Notebooks ...... 14 5 GATE Training Course Materials ...... 16 6 Business Process Monitoring and Mining Materials ...... 18 7 SoBigData Master Program Training Materials ...... 20 7.1 Module Example: Information Retrieval ...... 21 8 Conclusions ...... 23

D4.4 Data Scientists Training Materials 1 Page 5 of 23 SoBigData – 654024 www.sobigdata.eu

DELIVERABLE SUMMARY

This deliverable’s objective is to report on training materials that have been developed within Work Package 4. As in Task 4.2, Work Package 4 has the goal of developing interactive learning environments.

The first phase of this task is to survey existing training materials which have been developed by SoBigData partners in order to assess which materials may be adapted into the SoBigData research infrastructure and constitute a starting point to develop original training materials within the SoBigData project.

Hence, this phase of task 4.2 shall be followed by a two-step process. The first step shall be to integrate existing training and educational activities. This integration is aimed at promoting and developing existing training activities as open educational resources. The second step of task (T4.2) is to create new modules created specifically for groups that are currently lack of teaching material on social big data, such as social scientists and humanities researchers.

The training materials developed following task T4.2 within Work Package 4 of the SoBigData project will be published in open educational standards. This choice has been made in order to facilitate integration with relevant e-learning environments.

A specific group of modules aimed at data scientists is being developed. These modules will provide data scientists with best practices and guidelines to work on social big data resources provided by the SoBigData Research Infrastructure and exploit the SoBigData Research Infrastructure analysis tools.

Specifically, providers for each integrated national infrastructure are providing training materials in the respective thematic areas, whereas KCL is responsible for the development of tutorials aimed at social science and humanities researchers within the SoBigData Research Infrastructure. All other partners are contributing in the production of further materials focuses on applied social big data mining projects.

The development phase detailed in the following deliverable (D4.2) will be followed by a second phase, which will increase focus on production of multimedia environments to be used as learning and training resources. Moreover, there will be an increased focus on developing training materials based on advanced machine learning tools.

D4.4 Data Scientists Training Materials 1 Page 6 of 23 SoBigData – 654024 www.sobigdata.eu

EXECUTIVE SUMMARY

Deliverable 4.2 is a Data Scientists Training Material deliverable scheduled for M20 of the SoBigData project.

The aim of the deliverable is to report on training materials which have been developed by SoBigData partners which may have an application in fulfilling task T4.2 of Work Package 4. The materials include online training materials and materials which have been developed within face to face training events. Where possible, we provide access statistics for online training materials.

Moreover, Work Package 4 has also focused on contemporary developments in open for data science. Specifically, it appears that Python and R are the most relevant languages of choice for training data scientists. Hence, creation of training materials has focused on environments, which is in line with contemporary developments in the field of data science education.

Each section of this deliverable focuses on a specific set of training materials which have been developed by SoBigData partners within Work Package 4.

Section 2 provides insight into recent developments in open education for data science.

In order to provide a better integration with relevant e-learning environments training materials have been developed in open educational standards. Moreover, a further group of modules is in development. These modules aim to provide best practices and guidelines to exploit social big data resources which are part of the SoBigData Research Infrastructure. The first phase described in deliverable D4.2 shall be followed by a second phase focused on multimedia content and advanced machine learning tools which will be used as learning resource. In detail, Work Package 4 has developed training materials in open educational standards such as Python and R. This choice as been made as both Python and R are languages of choice for open education of data scientists.

Section 3 describes interactive training environments which have been developed for undergraduate and master level students based on R and Python.

KCL has developed a number of data science materials both for undergraduate and master level students. Using Swirl, a library developed in R, has allowed to provide students with a question-and-answer learning environment. Users have been guided in learning R programming language and the direct usage of specific data science libraries. Moreover, KCL has developed tools in R that can be deployed in workshops and other learning occasions. Other themes which have been or will be featured are digital communities, social sensing and historical cultures along with statistics.

Section 4 describes the development of notebooks within the Jupyter framework, which has become a standard in data, code, visualisations and results at data-science events.

KCL has been developing a complete set of tools around Jupyter notebooks. Jypter is a web-based computing environment and has become the de facto standard for presenting data, code, visualisation and

D4.4 Data Scientists Training Materials 1 Page 7 of 23 SoBigData – 654024 www.sobigdata.eu

results at data science events. Jupyter notebooks allow to the presence of code which is executable and modifiable within and can also contain text, mathematical equasions, interactive components, images, videos and more.

Section 5 describes materials developed within GATE, which are based on various text analysis techniques, which include information extraction, deep learning and social media mining.

In this deliverable we also presented GATE traning course materials. These materials focus on information extraction, deep learning and social media mining. These materials have been developed in occasion of the 9th GATE training course, which took place in Sheffield, UK in June 2016. GATE training course was organised around lectures and hands-on sessions. All material was then made available online. It includes training material on gathering social media data, crowd sourcing, machine learning, opinion mining and an introduction to JAPE, a specially developed pattern matching language for GATE. Overall, training materials were accessed online, between he directory pages and training materials, a total of been 17.013 times.

Section 6 describes materials developed for Business Process Monitoring and Mining. These materials have been developed by University of Tartu.

These materials, which have been developed for a summer school, provide insight into different types of process mining tools. Business Process Monitoring and Mining focuses on data produced during business processes, aiming to extract useful insights. Materials focus on different process mining tools, ranging from offline methods to runtime process mining. Moreover, these materials provide a hands-on approach centred on an open-source business analytics platform, named Apromore.

Section 7 describes materials developed for the SoBigData Master Program in Pisa, Italy. These materials have been developed by Univerisity of Pisa, Consiglio Nazionale delle Ricerche e Scuola Normale Superiore di Pisa.

These materials focus on a vast array of areas within big data analysis, ranging from ethics to data mining and machine learning to data visualisation and storytelling. The SoBigData Master Program in Pisa provides participants both lectures and laboratory activities. Materials have been specifically developed and range from slides to hands-on activities and include the development of Python code. Moreover, it provides a specific focus on one of the materials, centred on Information Retrieval. The Information Retrieval materials include slides on which a series of ten lectures is based on, followed by five different laboratory activity events which provide a hands-on approach to the themes described in the lecture series.

D4.4 Data Scientists Training Materials 1 Page 8 of 23 SoBigData – 654024 www.sobigdata.eu

1 RELEVANCE TO SOBIGDATA Work Package 4, entitled Training, aims to establish a joint training and education initiative on social big data within the European Research Area. The Work Package explores and develops both conventional and unconventional training experiences for master students, PhD students and early career post-doctoral researchers as well as an academically interested general public. Likewise, Work Package 4 proposes campaigns aimed at high school students to promote interest in data science with special emphasis on gender issues. These experiences include a number of different activities such as summer schools, datathons, training courses and specific high-school oriented courses. Moreover, in this report we provide an in-depth analysis of the materials that have been produced for data scientists among the activities of WP4.

1.1 PURPOSE OF THIS DOCUMENT

This deliverable focuses on training materials which have been developed within Work Package 4 of the SoBigData project in the reporting period.

Reports on training materials are described individually. Each report focuses on the core aspects of development, outcome and usage of training materials.

1.2 RELEVANCE TO PROJECT OBJECTIVES

The development of interactive learning environments is a key part of activities within Work Package 4. The aim of Work Package 4, entitled Training, is contributing to build a common initiative, both in training and education initiatives, within the European Research Area.

This deliverable focuses on training materials that partners have elaborated for data scientists. Overall, this report is organized on a type basis. Each material is described in detail, alongside a breakdown of data regarding training material access.

1.3 SOBIGDATA PROJECT DESCRIPTION

The SoBigData project's aim is to create a pan-European research infrastructure. This infrastructure will integrate already established national infrastructures. To this end, the SoBigData project has defined a series of key priorities:

1. Better access to the best national research infrastructures

2. Training a new generation of mobile researchers

3. More effective national research systems

4. Optimal transnational cooperation

5. Accelerating innovation through partnerships with industry

6. Effective data, method, and knowledge sharing

D4.4 Data Scientists Training Materials 1 Page 9 of 23 SoBigData – 654024 www.sobigdata.eu

1.4 RELATION TO OTHER WORKPACKAGES

WP2: as training activities fall under the legal and ethical framework of the SoBigData infrastructure

WP3: as training activities have a strong relationship with the dissemination and impact strategies developed for the whole SoBigData project

WP5: as training activities are connected to the innovation activities aimed at industry and other stakeholders

WP10: as training activities will present and educate on the new methodologies and technologies that SoBigData is developing

WP11: as training activities relate to the construction of a benchmarking and evaluation framework for big data analytics and social mining methods.

1.5 STRUCTURE OF THE DOCUMENT

The Data Scientists Training Materials deliverable is structured around each training material which has been developed within Work Package 4 of the SoBigData project. After providing insight into the current developments in open education on data science, the deliverable will describe specific environments developed within Work Package 4. The first will be environments developed in R and Python, followed environments developed in Jupiter and environments developed within GATE, Business Process Monitoring and Mining Materials and a review of materials produced for the SoBigData Master Program in Pisa, Italy.

D4.4 Data Scientists Training Materials 1 Page 10 of 23 SoBigData – 654024 www.sobigdata.eu

2 REPORT ON TRAINING ACTIVITIES: T2 – TRAINING MODULES

A key component of the SoBigData training work is the preparation of interactive learning environments to target especially social science and humanities researchers and other groups currently underrepresented in training on social big data. Since the original conceptualisation of the SoBigData project, there has been a lot of movement on open education for data science. Numerous online courses have appeared, while universities across Europe now have a wide range of offerings to teach students. Online education tools include by now commercial platforms offering subscriptions for professional data science education: https://www.datacamp.com/courses and https://www.dataquest.io/. Python and R dominate as the languages of choice for training data scientists. http://www.dataschool.io/teaching-data-science/ has an excellent summary of examples of existing principles that should guide the development of data science educational modules:

• Using GitHub from the beginning is an important principle so that students can present their work to future employers. • Learning by doing/coding will help understand the material better than traditional frontline power point presentations. • Using videos and other multimedia resources as teaching platforms will support students’ learning progress. • Working with real-life data within interesting scenarios will help demonstrate the relevance of data science work for economy and society.

As a result Work Package 4 has agreed to focus on open learning environments using two recent developments in interactive teaching tools. In a second step, we will produce more videos and other multimedia environments around learning and teaching resources and focus on advanced machine learning tools such as Weka (http://www.cs.waikato.ac.nz/ml/weka/), KNIME (https://www.knime.org/) or H2o (http://www.h2o.ai/).

D4.4 Data Scientists Training Materials 1 Page 11 of 23 SoBigData – 654024 www.sobigdata.eu

3 INTERACTIVE TRAINING ENVIRONMENTS

We have been developing and using a wide variety of data science materials for both undergraduate and masters level teaching, based on R and Python. R is a de facto standard in statistical computing and visualisation, while our materials are specifically based on R's Swirl library (http://swirlstats.com/). Swirl provides an interactive question-and-answer learning environment, in which the user may be guided through learning the syntax of R, the use of specific data science libraries, or the exploration of individual data sets. Swirl courses are essentially archives of interactive content, that combine instruction, data and code and even multimedia elements. To our knowledge, it is the only widely used interactive data science learning environment to date that is free to use at the moment. Courses can be downloaded for free from https://github.com/swirldev/swirl_courses#swirl-courses or tried out at DataCamp https://www.datacamp.com/community/open-courses/r-programming-with-swirl.

FIGURE 1. An interactive SWIRL Session

Swirl is best used within Rstudio, a freely available desktop or browser-based R environment that can embed visualisations. To support our Swirl courses, we have provided an Rstudio Docker image (https://github.com/kingsBSD/rstudio-kcl-ddh) and a VirtualBox appliance (https://github.com/kingsBSD/DDH-OneTrueBox) that include all the required R packages. This makes deployment for workshops, etc. a questions of minutes rather than hours. So far we have released several modules on the introduction to R using, for instance, an exploration of a dataset relating to the use of the death penalty in the US. Figure 1 displays a dataset which comprises each US senator and his\her vote on

D4.4 Data Scientists Training Materials 1 Page 12 of 23 SoBigData – 654024 www.sobigdata.eu

each bill presented in the US Senate. It is currently held a temporary repository at King's College London (https://github.com/kingsBSD/data-science-for-digital-humanities) and will be moved to the main SoBigData repository.

Other themes that have been explored and will be released include:

1 Digital Communities

2 Social Sensing

3 Historical Cultures

4 Time Series

5 Basic Statistics

We are currently testing the environment in a Master Degree on Big Data at King's College London.

D4.4 Data Scientists Training Materials 1 Page 13 of 23 SoBigData – 654024 www.sobigdata.eu

4 NOTEBOOKS

We are developing complete stories around Jupyter notebooks that can form easy recipes for reproducible methods in social data science. Jupyter (http://jupyter.org/) is a web-based computing environment, mostly, but not exclusively based on Python. Content is divided between a number of cells, which may contain code, visualisations and maps, interactive UI elements or rich HTML content. Jupyter has become another de facto standard for presenting data, code, visualisations and results at data-science events (http://pydata.org/)

Notebooks within the Jupyter framework allow embedding executable and modifiable code in an interactive and exploratory manner. On top of this foundation, the notebooks add a document-based workflow. Notebook documents can contain live code, descriptive text, mathematical equations, interactive user-interface components, images, videos, and arbitrary HTML. Such documents thus provide a complete and reproducible record of a computation and can be shared with others, version controlled and converted to a wide range of static formats (HTML, .pdf, slides, etc.).

Notebooks saved in the .ipynb format are correctly rendered with all the rich content within GitHub. There are free notebook hosting services for academic projects. Jupyter provides a web-based shell, and also supports R. Crucially, both R and Python are supported by Apache Spark, the distributed computing environment and machine-learning library that is displacing Hadoop's map-reduce for many data-science applications.

A good example for the advanced use of notebooks is our Apache Spark teaching and experimentation environment. We have wrapped Apache Spark in an experimental containerised web application that allows novice users to participate in ad-hoc clusters via Jupyter and produced corresponding multimedia guides (https://www.youtube.com/watch?v=9xsiV9dUlgI; https://github.com/kingsBSD/ananke ). Figure 2, below, displays a Jupyter Notebook exploring collaboration networks in the US Senate.

D4.4 Data Scientists Training Materials 1 Page 14 of 23 SoBigData – 654024 www.sobigdata.eu

FIGURE 2. A Jupyter Notebook developed by King’s College London

D4.4 Data Scientists Training Materials 1 Page 15 of 23 SoBigData – 654024 www.sobigdata.eu

5 GATE TRAINING COURSE MATERIALS

Training course materials provide insight into using a wide variety of text analysis techniques developed within GATE, an open source software focused on text processing. Techniques include information extraction, deep learning and social media mining. Further training was provided for the use of the GATE Cloud, which is part of the SoBigData project.

The material was created for an event and made available online after a live event. Currently the material is hosted at https://gate.ac.uk/wiki/TrainingCourseJune2016/. The material online features materials from all lectures and hands-on sessions. Hence, it provides a complete overview of all materials which have been produced for the training course. For example, Figure 3 (below) displays Annotations, one of the central structures in GATE.

The training material was developed for the GATE training course which was held between June 6 and June 10, 2016 at the University of Sheffield, UK. The training course has reached its 9th edition and is held on a yearly basis. The event was organised around lectures and hands-on sessions. The training course is attended by members of industry, research institutes and other early career researchers which aim to become familiar with the GATE family of text engineering tools and related platforms.

The training course focused on various aspects, such as gathering social media data, crowd sourcing, machine learning, opinion mining and an introduction to JAPE, a specially developed pattern matching language for GATE. Moreover, participants were provided with the choice of a final focus either on application or on programming within the GATE environment.

Hands-on sessions were focused on analysing contents from social media such as Twitter, blog posts and other social media content.

D4.4 Data Scientists Training Materials 1 Page 16 of 23 SoBigData – 654024 www.sobigdata.eu

FIGURE 3. An excerpt of the Introduction to GATE Developer materials

As all the material developed for the GATE training course is online it has been possible to provide data on online access to materials. This data has been mined from Apache logs. In total, since the pages were created, there have been 17.013 hits to the directory pages and training materials. All the material is available both in .pdf (Portable Document Format) and .odf (OpenOffice Impress) and is accessible without any type of restriction.

D4.4 Data Scientists Training Materials 1 Page 17 of 23 SoBigData – 654024 www.sobigdata.eu

6 BUSINESS PROCESS MONITORING AND MINING MATERIALS

Business Process Monitoring and Mining Materials provide insight into process mining methods and possible applications. It focuses on both offline process discovery tools and runtime process mining, while providing a hands-on approach focused on an open-source process mining tool.

The material was created for an event and made available online. Currently the material is hosted on a platform named SlideShare https://www.slideshare.net/MarlonDumas/business-process-monitoring-and- mining. The online material comprises slides from a specifically developed lecture. This material was developed by University of Tartu, Estonia and presented during a one day lecture at the II Latin-American School in Business Process Monitoring, held in Bogotá, Colombia in June 2017. The event was based on courses, meetings and lectures centred on business process monitoring and mining. For example, Figure 4 describes a Business Process Monitoring process, from an event stream to process mining.

Business process monitoring focuses on data produced during business processes in order to extract performance-related insights. Materials focus on different process mining tools, ranging from offline methods to runtime process mining. Offline process discovery methods include automated process discovery, conformance analysis, performance mining, and variance and deviance mining. Runtime process mining tools inlcude predictive and prescriptive process monitoring.

Business Process Monitoring and Mining Materials describe different mining methods and their applications. Moreover, these materials provide a hands-on approach centred on an open-source process mining tool, named Apromore. Apromore is an online business process analytics platform with a variety of features, which range from managing large process model collections to process mining analytics.

FIGURE 4. An excerpt of the slides developed by University of Tartu

D4.4 Data Scientists Training Materials 1 Page 18 of 23 SoBigData – 654024 www.sobigdata.eu

The Business Process Monitoring and Mining Materials are online and it has been possible to provide data on online access. The material was published on June 27, 2017 on SlideShare and has since been viewed 469 times, with a total number of 17 downloads. All the material is in .pptx, the default presentation file format for Microsoft Office PowerPoint. The SlideShare platform requires registration in order to download the material.

D4.4 Data Scientists Training Materials 1 Page 19 of 23 SoBigData – 654024 www.sobigdata.eu

7 SOBIGDATA MASTER PROGRAM TRAINING MATERIALS

SoBigData Master Program Training Materials have been developed within the Post-Graduate Master in Big Data Analysis and Social Mining held at the University of Pisa, Italy. The Post-Graduate Master is organised by the University of Pisa, the Istituto di Informatica e Telematica of the Italian National Research Council and the Istituto di Scienza e Tecnologie dell’Informazione of the Italian National Research Council.

The SoBigData Master Program focuses on a wide range of fields ranging from data mining and machine learning to data analysis and visualisation, complex systems science and networks, computational sociology and social simulation, ethics, data journalism and story-telling.

During the first phase of the SoBigData Master program participants take part in lectures and hands-on activities. A series of materials has been developed specifically for participants, and ranges from slides and lectures to specifically developed Python code regarding topics on which the SoBigData Master program focuses on.

There are fourteen different modules which comprise the SoBigData Master program which are:

1. Big Data Ethics

2. Big Data Sources, , Crowd-sensing

3. Data Driven Innovation

4. Data Journalism & Storytelling

5. Data Management for Business Intelligence

6. Data Mining & Machine Learning

7. Data Science for Quantitative Finance

8. Data Visualization & Visual Analytics

9. High Performance & Scalable Analytics, NO-SQL Big Data Platforms

10. Information Retrieval

11. Mobility Data Analysis

12. Text Analytics & Opinion Mining

13. Social Network Analysis

14. Web Mining

The modules developed within the SoBigData Master Program aim to provide participants with technical, analytical, narrative and ethical skills. Hence, the post-graduate master focuses on a diverse set of

D4.4 Data Scientists Training Materials 1 Page 20 of 23 SoBigData – 654024 www.sobigdata.eu

disciplines: data mining and machine learning, data analysis and visualisation, complex systems science and networks, computational sociology and social simulation, ethics, data journalism and story-telling.

All the materials are currently available online in a dedicated environment for SoBigData Master participants. The environment is organised using Moodle, a free, open-source learning platform. Moodle (an acronym for Modular Object-Oriented Dynamic Learning Environment) is both highly scalable and customizable and provides users with a range of features which include collaborative tools such as forums, wikis, chats and blogs. Moreover, Moodle is a web-based service and so which provides access both across devices and different web browsers.

All the SoBigData Master Program materials are currently hosted online at http://wafi.iit.cnr.it:8083/login/index.php. The Moodle platform requires registration to access the materials. From 2015 to 2017, a total of 75 participants to the SoBigData Master Program have accessed the SoBigData Master Program materials, together with 30 teachers and tutors.

7.1 MODULE EXAMPLE: INFORMATION RETRIEVAL

We shall now provide insight into one of the modules of the SoBigData Master Program, namely Information Retrieval. The Information Retrieval module is centered on ten lectures, ranging from an introduction to the field to parsing, crawling, complex queries, sorting and text ranking to topic annotation and clustering. A series of slides has been created specifically for each lecture, which provide both insight on the main topics as well as examples. Figure 5, below, offers a definition of Information Retrieval.

FIGURE 5. An exerpt of the SoBigData Master Program slides regarding Information Retrieval

The SoBigData Master Program is organised around different activities. Participants take part in teaching sessions, which are lectures cycles that focus on specific fields, such as Information Retrieval. Moreover,

D4.4 Data Scientists Training Materials 1 Page 21 of 23 SoBigData – 654024 www.sobigdata.eu

participants are encouraged to try out methods that have been featured in lectures in hands-on and laboratory sessions.

Specifically, regarding the Information Retrieval module of the SoBigData Master Program, a series of laboratory activities has been organised. A series of five different set of slides has been created, ranging from text analysis to text representation as word vectors, semantic annotation, web crawling and multi- field documents indexing and search with Elasticsearch, a representational state transfer search and analytics engine.

The slides, which are in Italian, focus on providing a hands-on approach to themes and notions introduced in the lectures and are currently hosted on Google Docs, a web-based service provided by Google at

• https://docs.google.com/presentation/d/1fRyPSKIXe8d4olXBNbOqrTNpBJf3R9E9xmXMzNlDVb0/edi t?usp=sharing • https://docs.google.com/presentation/d/1cRm6YG0ax1zbOlIsxv7zKtiOSROxiHMW1lxpPy9yhJ8/edit ?usp=sharing • https://docs.google.com/presentation/d/12YmKmex_LPWXRqXHmJsjVjrm_s4aDZ96zGJh39GfwWY/ edit?usp=sharing • https://docs.google.com/presentation/d/11w05pWHxYgPSVDxPcETtDVTyR3EKM90- QvUeQ89zX9Q/edit?usp=sharing • https://docs.google.com/presentation/d/1uzc_J9NupDsB1Y8GZ3jzOM9ls- bJy4ApI1MF3Pt6UKs/edit?usp=sharing.

All slides can be downloaded in a variety of formats, from .pptx, the default presentation file format for Microsoft Office PowerPoint to .pdf, portable document format.

D4.4 Data Scientists Training Materials 1 Page 22 of 23 SoBigData – 654024 www.sobigdata.eu

8 CONCLUSIONS

Deliverable 4.2 provides insight into materials which have been developed by different SoBigData partners within Work Package 4. At the current stage, the aim of the deliverable is to survey and assess different types of materials which have been developed in order to provide a starting point for a two-step process that will regard task 4.2 of Work Package 4. This two-step process will first focus on adapting existing materials into the SoBigData infrastructure and shall be followed by the development of original material which will be fully integrated into the SoBigData research infrastructure.

Hence, in order to provide an assessment of existing materials, a survey was distributed among SoBigData partners. The survey aimed to assess existing materials, providing insight into its distribution practices.

This deliverable describes in detail different focus areas of each SoBigData partner.

King’s College London has developed data science materials based on R and Python programming languages. Moreover, these materials have been developed using a specific R library, named Swirl which provides an interactive learning environment which guides through learning by doing. Moreover, King’s College London has developed training materials in Jupyter, a web based learning environment which may contain code, maps, interactive elements and HTML content.

Sheffield University has developed training materials centred on GATE, an open source software focused on text processing. Training materials explore information extraction, deep learning and social media mining. Moreover, these training materials provide insight into gathering social media data, crowd sourcing, machine learning, opinion mining and an introduction to JAPE, a specially developed pattern matching language for GATE.

University of Tartu has developed training materials centred on business process monitoring, an activity which focuses on data produced during business processes in order to extract performance-related insights. Specifically, training materials provide insight into different business process monitoring practices and a hands-on approach to an open-source process mining tool named Apromore, which has a variety of features, ranging from managing large process model collections to process mining analytics

Univerisity of Pisa, Consiglio Nazionale delle Ricerche e Scuola Normale Superiore have developed a vast body of materials for the SoBigData Master Program in Pisa, Italy. These materials have been specifically produced for the master program and focus on diverse topics within big data. Each material focuses on a specific field from big data ethics to activities that can be performed with big data to storytelling and data visualisation and presentation. Specifically, this deliverable provides the example of the Information Retrieval module. Training materials developed for this specific field include slides for a series of ten lectures and five different laboratory activities which provide a hands-on approach to themes explored during the lecture cycle.

In conclusion, deliverable 4.2 provides a first assessment of training materials produced by SoBigData partners which constitutes the first step towards the long-term goal of developing original training materials for groups that are currently lack of teaching material on social big data, such as social scientists and humanities researchers.

D4.4 Data Scientists Training Materials 1 Page 23 of 23