Deliverable 1.3

Open Data technological study

Author(s): Paul Hermans (ProXML) Editor(s): Paul Hermans (ProXML) Responsible Organisation: ProXML Version-Status: V1 Final Submission date: 30/09/2016 Dissemination level: PU

Disclaimer This project has been funded with support from the European Commission. This deliverable reflects the views only of the author, and the Commission cannot be held responsible for any use which may be made of the information contained therein.

This project has been funded with the support of the Erasmus+ programme of the European Union  Copyright by the ODEdu Consortium . D1.3 Open Data technological study

Deliverable factsheet Project Number: 562604-EPP-1-2015-1-EL-EPPKA2-KA Project Acronym: ODEdu Project Title: Innovative Open Data Education and Training based on PBL and Learning Analytics

Title of Deliverable: D1.3 – Open Data technological study

Work package: WP1 – Stakeholders needs regarding Open Data

Due date according to contract: 30/09/2016

Editor(s): Paul Hermans (ProXML)

Contributor(s): UOM

Reviewer(s): AcrossLimits

Approved by: All Partners

Abstract: This document presents an inventory of tools available for publishing and reusing Open Data. The study carried out towards this objective took into consideration the Open Data Lifecycle and the curriculum structure that were designed and proposed in the previous two deliverable of the current Work Package. Furthermore, the report performs assessment on teach technology and provides recommendations of the identified tools for each target group (persona) considered in the context of the project. The most important finding is that the majority of the Open Data Lifecycle phases can be covered easily with even multiple valid choices so that curriculum building can leverage this. Keyword List: Open data tool, open data portal, data wrangling, descriptive analytics, prescriptive analytics

Page 2 of 119

D1.3 Open Data technological study

Consortium

Role Name Short Name Country 1. Coordinator , academic University of Macedonia UOM Greece partner

2. Open Data expert Open Data Institute ODI UK

3. Problem Based Aalborg University AAU Denmark Learning expert

4. Technology enhanced AcrossLimits AcrossLimits Malta learning expert

5. Dissemination partner Association of Information Technology SEPVE Greece Companies of Northern Greece

6. Open / Linked Data ProXML ProXML Belgium technologies expert

7. Local Authorities Linked Organisation of Local Authorities LOLA Belgium partner

Page 3 of 119

D1.3 Open Data technological study

Revision History

Version Date Revised by Reason 0.1 01/08/2016 ProXML Initial list of open data technologies and tools

0.2 03/08/2016 ProXML Description of tools

0.3 17/08/2016 ProXML First roundup

0.4 22/08/2016 ProXML Tool Parts reviewed

0.4b 24/08/2016 UOM Provision of feedback

0.5 26/08/2016 ProXML Integration of feedback

0.6 31/08/2016 ProXML More evaluation wording

0.7 19/09/2016 ProXML Integration of partners’ feedback

0.8 19/09/2016 ProXML Executive summary + conclusion

1 26/09/2016 ProXML Final

1 30/09/2016 UOM Submission of deliverable

Statement of originality: This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both.

Disclaimer This project has been funded with support from the European Commission. This deliverable reflects the views only of the author, and the Commission cannot be held responsible for any use which may be made of the information contained therein.

Page 4 of 119

D1.3 Open Data technological study

Table of Contents DELIVERABLE FACTSHEET ...... 2 CONSORTIUM ...... 3 REVISION HISTORY ...... 4 TABLE OF CONTENTS ...... 5 LIST OF FIGURES ...... 8 LIST OF TABLES ...... 9 LIST OF ABBREVIATIONS ...... 10 EXECUTIVE SUMMARY ...... 11 1 INTRODUCTION ...... 12

1.1 SCOPE ...... 12 1.2 AUDIENCE...... 12 1.3 STRUCTURE ...... 12 2 METHODOLOGY...... 13 3 DEFINITIONS ...... 14 4 TECHNOLOGIES AND TOOLS SELECTION ...... 16

4.1 INTRODUCTION ...... 16 4.1.1 Lifecycle and curriculum based ...... 16 4.1.2 Technological constraints ...... 16 4.1.3 License constraints ...... 16 4.2 OPEN DATA LIFECYCLE ...... 17 4.3 TECHNOLOGIES AND TOOLS RELATED TO THE OPEN DATA LIFECYCLE ...... 18 4.3.1 Publish ...... 18 4.3.2 Reuse ...... 20 4.3.3 Summary ...... 22 4.4 CURRICULUM SKELETON ...... 22 4.5 TOOLS RELATED TO THE CURRICULUM SKELETON ...... 25 4.5.1 Introduction ...... 25 4.5.2 Filtering out sensitive data...... 25 4.5.3 Deduplication of similar records/rows ...... 25 4.5.4 Dashboard software ...... 26 4.5.5 Storytelling ...... 26 4.5.6 Big Data ...... 26 4.5.7 Statistical Data ...... 26 4.5.8 Linked Data ...... 26

Page 5 of 119

D1.3 Open Data technological study

5 TOOLS DESCRIPTION ...... 27

5.1 OPEN DATA PORTAL SOFTWARE ...... 27 5.1.1 Positioning in the Open Data Lifecycle ...... 27 5.1.2 CKAN ...... 28 5.1.3 OpenDataSoft Open Data Solution ...... 32 5.1.4 Socrata Open Data ...... 36 5.2 TOOLS TO ASSESS DATA QUALITY ...... 40 5.2.1 Positioning in the Open Data Lifecycle ...... 40 5.2.2 DataProofer ...... 40 5.2.3 Csv Lint ...... 42 5.3 TOOLS FOR ANONYMIZING DATA ...... 44 5.3.1 ARX ...... 44 5.3.2 Support in other tools ...... 46 5.4 TOOLS FOR DATA WRANGLING ...... 46 5.4.1 Positioning in the Open Data Lifecycle ...... 46 5.4.2 Introduction ...... 47 5.4.3 OpenRefine ...... 48 5.4.4 Trifacta Wrangler ...... 52 5.4.5 Talend Data Preparation Free Edition ...... 55 5.4.6 Exploratory ...... 58 5.4.7 Dataiku Data Science Studio Free Edition ...... 62 5.5 TOOLS TO CONVERT BETWEEN DATA FORMATS ...... 65 5.5.1 Positioning in the Open Data Lifecycle ...... 65 5.5.2 OpenRefine ...... 65 5.5.3 The DataTank ...... 65 5.6 TOOLS TO ANALYSE AND VISUALIZE THE DATA ...... 68 5.6.1 Positioning in the Open Data Lifecycle ...... 68 5.6.2 Introduction ...... 69 5.6.3 Data Voyager ...... 69 5.6.4 Dataseed ...... 73 5.6.5 Tableau Desktop Public ...... 76 5.6.6 Exploratory ...... 79 5.6.7 Dataiku Data Science Studio ...... 79 5.7 TOOLS TO VISUALIZE THE DATA ...... 80 5.7.1 Positioning in the Open Data Lifecycle ...... 80 5.7.2 Introduction ...... 80 5.7.3 Google Fusion Tables ...... 80 5.7.4 Vega and Vega-Lite ...... 83 5.7.5 Plotly ...... 86 5.7.6 Quadrigram...... 89 5.7.7 Datawrapper ...... 92 5.7.8 Raw ...... 95

Page 6 of 119

D1.3 Open Data technological study

5.8 TOOLS FOR BUILDING DASHBOARDS AND DOING STORYTELLING ...... 98 5.8.1 Positioning in the Open Data Lifecycle ...... 98 5.8.2 Tableau Public Desktop ...... 99 5.8.3 Plotly Online Dashboards ...... 99 5.8.4 Quadrigram...... 99 5.9 TOOLS FOR PREDICTIVE ANALYTICS ...... 99 5.9.1 Positioning in the Open Data Lifecycle ...... 99 5.9.2 Predictive analytics ...... 100 5.9.3 BigML ...... 100 5.9.4 DataScienceStudio ...... 103 5.9.5 SkyTree Express single-user Desktop ...... 103 5.9.6 RapidMiner Studio ...... 106 5.10 LINKED DATA TOOLING ...... 109 5.10.1 TopBraid Composer Free ...... 109 5.10.2 fluidOps Information Workbench ...... 111 6 TARGET GROUP – TOOL MAPPING ...... 114

6.1 PRIVATE SECTOR EMPLOYEE ...... 114 6.2 STUDENT WITH NO CODING SKILLS ...... 115 6.3 STUDENT WITH CODING SKILLS ...... 116 6.4 PUBLIC SECTOR EMPLOYEE ...... 116 7 CONCLUSION ...... 119

Page 7 of 119

D1.3 Open Data technological study

List of Figures

FIGURE 1. CKAN SCREENSHOT...... 28 FIGURE 2. OPENDATASOFT SCREENSHOT ...... 32 FIGURE 3. SOCRATA OPEN DATA SCREENSHOT ...... 36 FIGURE 4. DATAPROOFER SCREENSHOT ...... 41 FIGURE 5. CSV LINT SCREENSHOT...... 43 FIGURE 6. ARX SCREENSHOT ...... 45 FIGURE 7. OPENREFINE SCREENSHOT ...... 48 FIGURE 8. TRIFACTA WRANGLER SCREENSHOT ...... 52 FIGURE 9. TALEND DATA PREPARATION SCREENSHOT ...... 56 FIGURE 10. EXPLORATORY SCREENSHOT ...... 59 FIGURE 11. DATA SCIENCE STUDIO SCREENSHOT ...... 62 FIGURE 12. THE DATATANK SCREENSHOT ...... 66 FIGURE 13. DATA VOYAGER SCREENSHOT ...... 70 FIGURE 14. DATASEED SCREENSHOT ...... 73 FIGURE 15. TABLEAU SCREENSHOT ...... 76 FIGURE 16. GOOGLE FUSION TABLE SCREENSHOT ...... 81 FIGURE 17. VEGA SCREENSHOT ...... 84 FIGURE 18. PLOTLY SCREENSHOT ...... 87 FIGURE 19. QUADRIGRAM SCREENSHOT ...... 90 FIGURE 20. DATAWRAPPER SCREENSHOT ...... 93 FIGURE 21. RAW SCREENSHOT ...... 96 FIGURE 22. BIGML SCREENSHOT ...... 101 FIGURE 23. SKYTREE DESKTOP EDITION SCREENSHOT ...... 104 FIGURE 24. RAPIDMINER SCREENSHOT ...... 107 FIGURE 25. TOPBRAID COMPOSER SCREENDUMP ...... 110 FIGURE 26. IWB SCREENSHOT ...... 112

Page 8 of 119

D1.3 Open Data technological study

List of Tables

TABLE 1. OPEN DATA LIFECYCLE ...... 17 TABLE 2. OPEN DATA SKILLS ...... 22 TABLE 3. OPEN DATA PORTAL POSITIONING ...... 27 TABLE 4. QA POSITIONING ...... 40 TABLE 5. DATA WRANGLING POSITIONING ...... 46 TABLE 6. FORMAT CONVERSION POSITIONING ...... 65 TABLE 7. ANALYSING DATA POSITIONING ...... 68 TABLE 8. VISUALISATION POSITIONING ...... 80 TABLE 9. DASHBOARDS/STORYTELLING POSITIONING ...... 98 TABLE 10. PREDICTIVE ANALYTICS POSITIONING ...... 99 TABLE 11. PRIVATE SECTOR EMPLOYEE MAPPING ...... 114 TABLE 12. STUDENT (NO CODING) MAPPING ...... 115 TABLE 13. STUDENT CODING MAPPING ...... 116 TABLE 14. PUBLIC SECTOR EMPLOYEE AS PUBLISHER MAPPING ...... 116 TABLE 15. PUBLIC SECTOR EMPLOYEE AS REUSER MAPPING ...... 117

Page 9 of 119

D1.3 Open Data technological study

List of Abbreviations The following table presents the acronyms used in the deliverable in alphabetical order.

Abbreviation Description API Application Programming Interface

CSV Comma Separated Values

DCAT Data Catalog Vocabulary

DCAT-AP Data Catalog Vocabulary Application Profile

ETL Extract, Transform, Load

JSON Javascript Object Notation

MB MegaByte

ML Machine Learning OS Operating System QA Quality Assurance RDF Resource Description Framework TSV Tab separated values XML Extensible Markup Language

Page 10 of 119

D1.3 Open Data technological study

Executive Summary The objective of ODEdu is to establish a Knowledge Alliance between academia, business and the public sector that will boost Open Data education and training. WP1 aims to identify the knowledge and skills that university students and employees of private and public organizations need to acquire in order to publish and re-use Open Data as well as to identify all existing technologies used in Open Data lifecycle and determine innovative and engaging ways to incorporate their usage in the learning methods’ steps. This deliverable is the last report of WP1, named D1.3 “Open Data technological study”. Its purpose is to identify and study existing technologies that are used for Open Data publication and re-use and assess these technologies in regards to how well they fit to the project’s scope. Based on the research carried out, it seems that there is a wide availability of tools and technologies for publishing and/or reusing open data. In this study we give an overview of these tools and technologies. Guidelines for this collection were the Open Data Lifecycle and the curriculum structure as defined in Deliverable 1.1 “Stakeholders Needs Regarding Open Data”. Additional concerns in the collection were the lack of technological and/or licensing constraints for installing and using the tools. We only retained those tools that are accessible by a modern web browser or can be installed on most PC’s (Windows and Mac). In addition, only open source or tools that have a community or freemium version were included. We describe every tool using the same structure: where it fits in the Open Data Lifecycle, its targets groups, its functionalities, contact info, a general assessment and some notes based on own experience. The most important finding is that most Open Data Lifecycle phases can be covered easily with even multiple valid choices so that curriculum building can leverage this. At the end we make recommendations on which tools are more appropriate to use in every phase of the Open Data Lifecycle for each of our target groups: private sector employees, students with no coding skills, students with coding skills, public sector employees in both possible roles, as publisher and re-user. For example, in regards to the publishing phase we consider it essential to get data cleaning skills using OpenRefine and a thorough knowledge of CKAN, the most broadly used open data portal software. For reusing data the recommendations differ depending on the profile and coding skills of the user. Recommendations for all groups are OpenRefine for data wrangling, Voyager for data understanding, Google Fusion Tables for data visualisation and BigML for prescriptive analytics. Sometimes better alternatives are available. E.g. Trifacta Wrangler for data wrangling and Tableau Public for visualisation in the private sector and Dataiku’s Data Science Studio (all-encompassing) and Plotly or Vega for visualisations when people have coding skills.

Page 11 of 119

D1.3 Open Data technological study

1 Introduction

1.1 Scope The presented document is the Deliverable 1.3 “Open Data technological study” of the ODEdu project. The main objective of D1.3 is to give an overview of available tools and technologies that support and/or can be used within the Open Data Lifecycle phases as defined in D1.1.

1.2 Audience The document is for:  The project partners

 The European Commission

 The Open Data community in its broadest sense.

1.3 Structure The structure of the document is as follows: . Section 2 covers the methodology used to find and describe the tools . Section 3 contains some definitions of terms used to describe the tools . Section 4 describes how the tools have been selected . Section 5 contains the descriptions of the tools selected . Section 6 maps the needs of the different target groups to the tools described . Section 7 concludes the study.

Page 12 of 119

D1.3 Open Data technological study

2 Methodology The open data related tools have been collected by:

 following ‘open’ data evangelists and data journalists on Twitter.  reviewing following subject groups on Flipboard1: Open Data, Data Science, Machine Learning, Statistics, Visualization.  searching Google with following keywords: open data tools, data visualisation tools, data wrangling tools, data blending tools, data quality tools, open data portal software, machine learning tools, descriptive analytics tools, prescriptive analytics tools, dashboard software, storytelling software. We didn’t investigate the scientific literature and this on purpose, since our experience tells us that the tools developed in a R&D context rarely survive the end of the research project. From this long list we made a selection of those tools that are frequently used, supported and maintained by a community of users or a commercial party. A further selection has been made by the criteria outlined in Section 4: Technologies and tools selection.

1 https://flipboard.com/

Page 13 of 119

D1.3 Open Data technological study

3 Definitions This section includes definitions of some of the terms that will be used throughout the deliverable to ensure comprehension. Dashboard A dashboard is a collection of tables, visualizations and supporting information shown in a single place so one can compare and monitor a variety of data simultaneously. Data blending Bringing data together from multiple sources. Data cleaning, cleansing, scrubbing Three names for the process of detecting and correcting (or removing) inaccurate records from a dataset, table, or database. Data wrangling Data wrangling refers to any data transformations required to prepare a dataset for downstream analysis, visualization or other operational consumption Descriptive analytics Analytics that describe historical data by giving:

 Info on the distribution of a single variable by calculating min, max, median, quartiles, mean and standard deviation using graphics such as histograms and box plots

 Bivariate analysis results indicating correlation and covariance using scatterplots, … ETL ETL stands for Extract, Transform, Load

 Where extract stands for retrieving data from a source

 Transforming stands for converting from source to target format

 Loading stand for bringing the data to the receiving end Predictive analytics Analytics done to predict what will be happening in the future based on historical knowledge. Predictive analytics uses many techniques from data mining, statistics, modeling, machine learning, and artificial intelligence to analyze current data to make predictions about future. Storytelling

Page 14 of 119

D1.3 Open Data technological study

Data storytelling is the process of translating data analyses into layman's terms. Stories serve to show how facts are connected, they provide context, demonstrate how incomes relate to outcomes, or to build a case.

Page 15 of 119

D1.3 Open Data technological study

4 Technologies and tools selection

4.1 Introduction

4.1.1 Lifecycle and curriculum based In D 1.1. “Stakeholders Needs Regarding Open Data”, we described the Open Data Lifecycle. On one hand, we have the tasks needed to be performed to publish data as Open Data. Once the data are published they can be reused by the data consumers (entrepreneurs, students, researchers, citizens). Based on this lifecycle, we listed the Open Data related knowledge, skills and attitudes which were of relevance for university students, private sector employees and public sector employees. This resulted in a curriculum structure proposal consisting of units of learning to obtain the related knowledge and skills. It is obvious that for carrying out some tasks within the lifecycle some supporting technologies and tools have been developed and that the skills to work with these tools need to be acquired. In this chapter we try to enumerate the existing technologies and tools according to both lifecycle and curriculum. The concrete description of the technologies and tools will be done in the subsequent chapters.

4.1.2 Technological constraints We have only included products/tools that are available:  as a web service accessible by modern web browsers or  as a desktop application installable on Windows and MacOS. As an example, QlickSense Desktop, a free visualisation tool is not withheld in our list since it is only available on Windows. Reason is that we would like to cover as many students a possible without having to go through the hassle of long and difficult installation procedures.

4.1.3 License constraints We are taking into account only these tools that are available as open source or in a free or community version/edition. Again, we do not want to exclude students due to licensing or financial constraints.

Page 16 of 119

D1.3 Open Data technological study

4.2 Open Data Lifecycle Below is an overview of the Open Data Lifecycle divided in 2 major phases:

 Publish the data  Reuse the data The Open Data Lifecycle used here is taken from D 1.1. “Stakeholders Needs Regarding Open Data” based on an extensive literature review combined with own experiences in teaching “Open Data” content. The publish phase is further divided in stages involving collecting the data, preparing the data, publishing the data, making sure the data can be found and discovered while maintaining the whole publishing process. The reuse phase contains the steps of obtaining of the data, followed by scrubbing the data in the form and format needed for the own purposes, data to be explored and visualised to gain insight and, if needed, to be used to make predictions. The result of this process can be communicated with dashboards and via storytelling.

Table 1. Open Data Lifecycle Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

Page 17 of 119

D1.3 Open Data technological study

4.3 Technologies and tools related to the Open Data Lifecycle

4.3.1 Publish

4.3.1.1 Description An Open Data publisher needs to prepare the datasets. One of the most cited reasons not to publish data as Open Data is the uncertainty of the quality of the data. Hence, it is of uttermost importance that publishers can evaluate the quality of their data and can act upon this assessment accordingly. Hence, we need to offer tools that: 1. Allow to detect potential issues in the data 2. Offer functionalities to remedy the detected problems. Once the data are quality assured, they potentially need to be converted into other distribution formats (CSV, JSON, XML, RDF, etc.). Once the distribution formats are available, they need to be made available online potentially using an open data portal. When the files are published, one must make sure that they are easily discovered. Adding them to open data portal catalogues by assigning descriptive metadata will surely help. A standard for doing this is DCAT2 and more specifically in a European context the DCAT-AP application profile3. When people retrieve a dataset they should get guidance on what the data themselves contain. A standard that is relevant in this context is the W3C Metadata Vocabulary for Tabular Data4.

4.3.1.2 Categories of tools needed A data publisher can be supported by tools to:

 assess the data quality,  enhance the data quality,

 convert the source data in multiple distribution formats,  publish the datasets including dataset metadata and metadata on the contained data themselves, using open data portal software.

2 https://www.w3.org/TR/vocab-dcat/ 3 https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-ap-v11 4 https://www.w3.org/TR/tabular-data-model/

Page 18 of 119

D1.3 Open Data technological study

4.3.1.3 Tools to assess Data Quality The following list includes available tools that assess Data quality:

 DataProofer

 Csv Lint  OpenRefine

 Trifacta Wrangler  Talend Data Preparation Free Desktop

 Exploratory

 DataScienceStudio

4.3.1.4 Tools to enhance Data Quality The following list includes available tools that enhance Data quality:

 OpenRefine  Trifacta Wrangler

 Talend Data Preparation Free Desktop  Exploratory

 DataScienceStudio

4.3.1.5 Tools to convert to other distribution formats The following list includes available tools that convert data to other distribution formats:

 OpenRefine

 theDatatank

4.3.1.6 Open Data Portal software The following list includes available Open Data Portal software:

 CKAN/DKAN  OpenDataSoft

 Socrata

Page 19 of 119

D1.3 Open Data technological study

4.3.2 Reuse

4.3.2.1 Description An Open Data re-user first of all needs to discover Open Data of interest. So, he needs to find his way around in Open Data portals navigating and searching for the datasets of potential interest. And, when found, to download them. Once a dataset is obtained, the re-user can be supported by tools to:  asses the data (quality),

 ameliorate the data for its own needs, potentially blending it with other datasets or  convert the dataset to a format more appropriate to its own workflow and processes. Once the data is in the required shape, the user will want to explore the data by calculating statistics and visualizing the data. Potentially, the data re-user wants to make predictions based on the data. Then, he needs tools to model the data for this purpose and to build several prediction models. The following section presents lists of different tools that support Open Data reusing for the aforementioned purposes.

4.3.2.2 Categories of tools needed A data re-user can be supported by tools to:

 find relevant datasets by using open data portals

 asses, cleanse and enhance (by blending) the data  convert the data into a more appropriate format

 explore and visualise the data  model and make predictions based on the data

4.3.2.3 Open Data Portal software The following list includes available Open Data Portal software:  CKAN

 OpenDataSoft  Socrata

4.3.2.4 Tools to asses, cleanse and enhance (by blending) datasets The following list includes available tools that support datasets assessment, cleansing and blending:

 OpenRefine

Page 20 of 119

D1.3 Open Data technological study

 Trifacta Wrangler  Talend Data Preparation Free Desktop

 Exploratory  DataScienceStudio Community Edition

4.3.2.5 Tools to convert to other distribution formats  OpenRefine

4.3.2.6 Tools to explore the data The following list includes available tools for data exploration. To explore the data they also offer several visualisation features.

 Exploratory  Tableau Public

 Data Voyager  DataScienceStudio

4.3.2.7 Tools to visualize data The following list include available tools specifically aimed at making visualisations:  Raw

 Vega  Google Fusion Tables

 Datawrapper

 Dataseed  Plotly

 Quadrigram

4.3.2.8 Tools to model and predict The following list includes available tools that support modelling Open Data and performing predictive analysis:

 BigML  Dataiku Data Studio Community Edition

Page 21 of 119

D1.3 Open Data technological study

 SkyTree  RapidMiner

4.3.3 Summary The research carried out showed that there are a lot of tools that are of interest and that can be used from both the publishing and re-use perspective. These are the tools addressing data quality, data cleansing, data enhancement and blending, data format conversion and data visualisation. From the reuse side, powerful tools are needed for data exploration, visualisation and prediction. Open Data portals are of interest for both roles but the relevant functionalities differ depending on the perspective.

4.4 Curriculum skeleton The below Table shows the collected knowledge and skills needs in the field of Open Data, as it was concluded in the D1.1 report of the project.

Table 2. Open Data Skills OD SKILLS OBTAINING DATA Identify existing Open Data portals (local, regional, country level) (K) Explain how datasets have been tagged with where to find OD metadata (K) Search an Open Data portal (S) Navigate an Open Data portal (S) which data to use Identify data of interest (K) how to download and explore a dataset Download and explore a dataset (S) Identify suitable data for blending (K) Knowledge of tools that allow data blending how to blend Open Data with other (Alteryx, Open Refine, …) (K) Handle interoperability issues and use existing standards (S) how to convert from one format to another Convert data files into useful formats (S) Verify that the license used allows for the re- what about licenses? use the user wants (S) Create queries that fetch data with minimum loading effort (S) how to retrieve data from data stores Create queries that retrieve data from multiple data stores (S)

Page 22 of 119

D1.3 Open Data technological study

Describe complex querying concepts (K) what are the common OD tools? Identify available tools to obtain data (K) SCRUBBING DATA how to filter out sensitive data Employ anonymizing skills (S) Discover what the original source of the data is (S) Reproduce evaluation criteria for good quality of data (K) how can I check the quality of my data Determine the quality level of Open Data (S) Untangle data (S) Identify if my data needs cleaning (S) Deal with unclear/incomplete data (S) Identify how correct is the information in the datasets (S) understand the validity of Open Data Discover where potential problems are, if any (S) Explain how to clean data (K) Perform cleaning methods (S) how to clean data Understanding the challenges of messy data (K) Perform record deduplication and/or record how to handle redundancy in data collection merging (S) what are the common OD tools? Identify available tools to clean my data (K) EXPLORE DATA Use analytics tools (S) Identify the proper analysis method to perform based on my objective (S/K) Perform practical data analysis (S) how to analyse data Identify different technologies that allow data analysis (K) Decide what type of analysis is more suitable for my data (S/K) Search for statistical Open Data (S) how to exploit statistical Open Data Perform analytics on statistical data (S) what are the common OD tools? Locate and use practical tools (K/S)

how to handle a lot of data Work with big data sets (S)

VISUALIZE DATA

Page 23 of 119

D1.3 Open Data technological study

Identify available technologies for creating visualizations (K) how to create a visualization Use visualization software(s) (S) Apply simple ranking of data, proportions and distributions etc. (S) Identify the visualize best suited to a given data set (K/S) which visualization is best for the type of Navigate to a service that allows me to view data/user/problem addressed datasets (S) Identify what each type of visualization's added value and purpose (K) how to create visualizations with geographical data Identify methods / technologies required for (maps) map visualizations with Open Data (K) how to work with dashboards Create a dashboard (S) what are the common OD tools? Identify the OD tools for visualization (K) MODEL DATA Identify different data analytics methods for creating predictions (K) Explain machine learning and other analytics how to create predictions on important issues (e.g. methods (K) pollution levels, radiation measurements etc.) Identify tools that can perform predictions through Open Data (K) Develop prediction services / applications with Open Data (S) Recall existing success stories (K) how to create visualizations that are meaningful Identify methods and tools that create (e.g. predictions, statistical analysis etc.) predictions with complex visualizations (K) what are the common OD tools? Identify the OD tools for modelling (K) INTERPRET DATA Interpret data (S) Explain what data means and its value (K) how to understand the data Explain how data has been structured and organised (K) how to interpret findings Interpret findings (S) Validate the output / results of e.g. OD how to validate output visualizations (S)

what are the common OD tools? Identify the OD tools for interpreting (K)

PRESENT DATA

Page 24 of 119

D1.3 Open Data technological study

Present data (S) Determine what statistical methods to use for specific situations (K/S) how to present data Pick a visualization that cleanly and powerfully tells a story from the data (S) Perform information graphics (S) how to do storytelling based on my data Link my story with Data visualization (S)

4.5 Tools related to the Curriculum skeleton

4.5.1 Introduction We use the curriculum skeleton to find out which specific skills using (open) data related technologies and tools are mentioned and are not addressed yet by the tools enumerated based on the Open Data Lifecycle analysis. Not previously mentioned tool related skills are:

 How to filter out sensitive data  How to remove duplicate entries/records, however this can be considered as a specific function in the cleansing process  How to create dashboards

 How to do storytelling with the data More specialised skills have to do with working with:  Big Data

 Statistical Data  Linked Data

4.5.2 Filtering out sensitive data Tools in this area are:

 Arx  Data anonymizer plugin in Dataiku DSS

 Talend Data Preparation Desktop

4.5.3 Deduplication of similar records/rows The following list includes available tools for deduplication:

Page 25 of 119

D1.3 Open Data technological study

 Trifacta Wrangler  DataScienceStudio

 OpenRefine

4.5.4 Dashboard software Tools for building dashboards are:

 Tableau Public

 Plotly

4.5.5 Storytelling Tools with storytelling functionalities are:

 Tableau Public  Quadrigram

4.5.6 Big Data No free software found to handle big data for non-programmers.

4.5.7 Statistical Data These tools can be used for handling multi-dimensional statistical data:

 Exploratory  Tableau Public

4.5.8 Linked Data  fluidOps IWB  Topbraid Composer Free Edition

Page 26 of 119

D1.3 Open Data technological study

5 Tools Description This chapter presents the tools mentioned in the previous chapter and provides an overall overview of each tool along with its functionalities, technical characteristics and an assessment if and where it can fit in the curriculum. This assessment is done in a structured way for the open data portal software and the tools that cover more phases of the Open Data Lifecycle. Tools that are mainly focusing on one functionality (e.g. anonymizing records) receive a denser treatment.

5.1 Open Data Portal Software

5.1.1 Positioning in the Open Data Lifecycle Table 3. Open Data Portal positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

Page 27 of 119

D1.3 Open Data technological study

5.1.2 CKAN

5.1.2.1 What is CKAN stands for Comprehensive Knowledge Archive Network. It is a dataset cataloguing solution that aims to make data accessible by providing tools to publish, share, find and use datasets. An Open Data catalog lists datasets on the Web. Data catalogs are like directories (remember Yahoo). They know what open data exists, what it is about, where it is and how to get hold of it.

5.1.2.2 How does it look like Figure 1 shows the CKAN environment as it is available to users online.

Figure 1. CKAN screenshot

5.1.2.3 Developer/maintainer Open Knowledge Foundation5.

5 https://okfn.org/about/

Page 28 of 119

D1.3 Open Data technological study

5.1.2.4 Used by Many governments and organisations. The most visible ones being: data.gov.uk, data.overheid.nl, europeandataportal.eu, data.gov; the Open Data portals of the cities Amsterdam, Copenhagen, Berlin.

5.1.2.5 Functionalities Functionalities for publishing are, the ability to:

 enter metadata on datasets via a website form, API or bulk spreadsheet import

 harvest datasets from other portals: CSW servers, other CKAN instances, ArcGis, …  publish these metadata publicly or privately to authorized organisations

 optionally store the data themselves in a data store  theming the look and feel of the portal to represent the own identity

 extend the portal with additional features. There are more than 60 extensions available. Functionalities for reusing are, the ability to:

 search and discover datasets. The system offers search on metadata, full text search, fuzzy matching, faceted search. Also geospatial search and discovery is available.  broadcast data to social media. There is an integration with Twitter, Facebook and Google+

 get updates on dataset changes using a RSS/Atom feed

 visualize the data managed in the data store as a table, graphic, map and/or image depending on the nature and content of the data

 investigate the history of edits and versions of a dataset  exploit the metadata and data via API’s

5.1.2.6 Technical constraints The software can be easily installed on 64 bits Ubuntu. Also, it can be built from source for other Unix flavours. There are dependencies on Python, Java, PostgreSQL, Solr and other technologies.

5.1.2.7 License Open Source e.g. Affero GNU GPL v3.0.

5.1.2.8 Evaluation

5.1.2.8.1 Context

Page 29 of 119

D1.3 Open Data technological study

Publish Reuse + + CKAN’s aim is to publish metadata of the datasets, so that these can be found and reused.

5.1.2.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + + It is an important tool for all target groups. E.g. for the public sector employees to learn how to publish datasets and for all the others as the means to find and reuse datasets.

5.1.2.8.3 Functionalities Functionality Available Dataset publishing + Data publishing + Dataset discovery + Data download +

Dataset API + Data API If data loaded Data Quality - Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion - Descriptive analytics - Visualisation If data loaded Dashboarding - Prescriptive analytics - Storytelling / - Notebooks CKAN’s main purpose is to publish datasets and its metadata so that they can be easily found.

Page 30 of 119

D1.3 Open Data technological study

Support Available Big Data ± Multidimensional data ± (cubes) Linked Data ± (via add-ons) All types of dataset distributions are supported. However, it is unclear if other functionalities such as visualisations will work with big and multidimensional data. Using add-ons one is able to interrogate the system via a SPARQL endpoint.

5.1.2.8.4 Assessment Aspect Assessment Ease of installation - Ease of use ± User centric versus Mix technology/science centric Installation needs to be done by ICT knowledgeable people. Lots of users decide to put another system (e.g. Drupal) in front to ameliorate the user experience.

5.1.2.8.5 Notes Since this portal software is the most broadly used, it must be included in our curriculum. For training the publishing skills, it would be good to have an own instance of CKAN running, where publishers can test and exercise the publishing related functionalities. Another possibility is to use the public CKAN demo site at http://demo.ckan.org/. For reuse training we recommend to use http://www.europeandataportal.eu/ since it is a reference implementation managing more than 580.000 sets from over all Europe.

5.1.2.9 Contact Website: http://ckan.org/ Download: http://ckan.org/developers/docs-and-download/

Page 31 of 119

D1.3 Open Data technological study

5.1.3 OpenDataSoft Open Data Solution

5.1.3.1 What is OpenDataSoft is a platform that has been specifically designed for non-technical business users to share, publish and reuse structured data. It is more than a data catalog solution, because the platform also manages the data themselves leading to additional functionalities such as showing the data as tables, maps and graphics, to convert them to different output formats and offering api’s for app developers.

5.1.3.2 How does it look like Figure 2 shows the OpenDataSoft environment as it is available to users online

Figure 2. OpenDataSoft screenshot

5.1.3.3 Developer/maintainer OpenDataSoft6.

6 https://www.opendatasoft.com/

Page 32 of 119

D1.3 Open Data technological study

5.1.3.4 Used by National, regional and local administrations. Companies and organisations in the domains of transport, energy & environment, agriculture, chemical, tourism, media.

5.1.3.5 Functionalities Functionalities for publishing are, the ability to:  load the data in the data store. During the load processing the data can be pre-processed, optimised, enriched and configured for optimal visualization,

 enter metadata on the dataset via a website form and API,  harvest data from external API’s,

 publish the data in other formats than uploaded,  publish the data publicly or privately according to access control rules based on users, groups and roles management,

 monitor the use of the datasets. Functionalities for reusing are, the ability to:

 search and discover datasets. The system offers search on metadata, full text search, fuzzy matching, faceted search. Also search based on geographical coordinates,

 filter data within the dataset,

 visualize data as tables, maps, graphics, calendar, images depending on type and content of data,

 download data in the format chosen,  subscribe to a dataset,

 broadcast to social media such as Twitter, LinkedIn, Facebook and Google+,

 comment on datasets and data and post reuse proposals,  exploit the datasets and data via API.

5.1.3.6 Technical constraints The software is offered as a service (SaaS). Its pay-as-you-use subscription fee is based on data volume, usage (number of UI/API queries) and SLAs.

5.1.3.7 License Commercial.

Page 33 of 119

D1.3 Open Data technological study

5.1.3.8 Evaluation

5.1.3.8.1 Context Publish Reuse + + OpenDataSoft publishes metadata of the datasets and the data itself. They can be found and reused, but the data can already be interpreted on the system itself without needing them to download.

5.1.3.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + + It is aimed at all target groups. E.g. for the public sector employees to learn how to publish datasets and the data and for all the others as the means to find, investigate and reuse datasets.

5.1.3.8.3 Functionalities Functionality Available Dataset publishing + Data publishing + Dataset discovery + Data download + Dataset API + Data API + Data Quality ± Assessment Data Cleansing + Data + Blending/Enrichment Data format conversion + Descriptive analytics - Visualisation + Dashboarding -

Page 34 of 119

D1.3 Open Data technological study

Prescriptive analytics - Storytelling / - Notebooks In addition to the pure data cataloguing features, OpenDataSoft offers additional capabilities to: - clean and enhance the data - visualize the data - convert the data - make them accessible via API. In this sense it offers much more functionalities than the more pure cataloguing solutions such as CKAN.

Support Available Big Data - Multidimensional data - (cubes) Linked Data - No indication of support found.

5.1.3.8.4 Assessment Aspect Assessment Ease of installation SaaS Ease of use + User centric versus user technology/science centric Good emphasis on ease of use.

5.1.3.8.5 Notes OpenDataSoft is a very interesting product since it is more than a dataset catalogue. It adds:

 data cleaning, enhancement and transformation functions

 data visualization  data conversion to different formats

Page 35 of 119

D1.3 Open Data technological study

 an elaborate data API It is possible to use the service for testing in a free modus.

5.1.3.9 Contact Website: https://www.opendatasoft.com/open-data-solutions/

5.1.4 Socrata Open Data

5.1.4.1 What is Socrata Open Data is a cloud based solution for managing and publishing data offering much more than cataloguing services.

5.1.4.2 How does it look like Figure 3 shows the Socrata environment as it is available online.

Figure 3. Socrata Open Data screenshot

Page 36 of 119

D1.3 Open Data technological study

5.1.4.3 Developer/maintainer Socrata7.

5.1.4.4 Used by City of New York, Melbourne, Bath (UK), Chicago, San Francisco, New Orleans, Boston, Las Vegas, Dallas

5.1.4.5 Functionalities Functionalities for publishing are, the ability to:

 load the data in the datastore. During the load processing the data can be pre-processed and configured,  sync data with the master dataset,

 enter metadata on the dataset,  edit the data and creating snapshots,

 create additional views and visualisations,  publish the data in other formats than uploaded,

 publish the data publicly or privately. Functionalities for reusing are, the ability to:  search and discover datasets by search and metadata filtering,

 filter data within the dataset,  explore and create additional views (graphics, maps, calendars, dashboards),

 download data in the format chosen (including RDF),

 discuss datasets,  exploit the datasets and data via API,

 expose the data as OData for easier integration in the Microsoft Ecosystem.

5.1.4.6 Technical constraints The software is offered as a service (SaaS). Pricing is unclear.

7 https://socrata.com/

Page 37 of 119

D1.3 Open Data technological study

5.1.4.7 License Commercial.

5.1.4.8 Evaluation

5.1.4.8.1 Context Publish Reuse + + Tool for data for publishers so that others can find and reuse.

5.1.4.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + + It is relevant for all target groups. E.g. for the public sector employees to learn how to publish datasets and the data itself and for all the others as the means to find, investigate and reuse datasets.

5.1.4.8.3 Functionalities Functionality Available Dataset publishing + Data publishing + Dataset discovery + Data download + Dataset API + Data API + Data Quality - Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion + Descriptive analytics -

Page 38 of 119

D1.3 Open Data technological study

Visualisation + Dashboarding - Prescriptive analytics - Storytelling / - Notebooks Similar to OpenDataSoft, Socrata offers much more than a pure cataloguing solution.

Support Available Big Data - Multidimensional data - (cubes) Linked Data + (as export format) Socrata allows to map the data into the RDF data model, so that the data can be used in Linked Data applications.

5.1.4.8.4 Assessment Aspect Assessment Ease of installation SaaS Ease of use + User centric versus User technology/science centric

5.1.4.8.5 Notes Feature-wise this is very similar to the OpenDataSoft ordering, which is a European product and much more used in Europe. So, we advise to rather include OpenDataSoft in the training instead of Socrata.

5.1.4.9 Contact Website: https://socrata.com/products/open-data/

Page 39 of 119

D1.3 Open Data technological study

5.2 Tools to assess data quality

5.2.1 Positioning in the Open Data Lifecycle Table 4. QA positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

5.2.2 DataProofer

5.2.2.1 What is Dataproofer is a cross platform desktop app to run a collection of tests over a data file supplied in xlsx, xls, CSV, TSV, PSV formats.

5.2.2.2 How it looks like Figure 4 shows the DataProofer platform.

Page 40 of 119

D1.3 Open Data technological study

Figure 4. DataProofer screenshot

5.2.2.3 Developer/maintainer Knight Foundation8 and Vocativ9.

5.2.2.4 Used by Vocativ.

5.2.2.5 Functionalities Dataproofer indicates which of a series of tests passed. As a default, it loads 15 tests for indicating:

 string and numeric cells

 empty and duplicate cells.  outliers related to mean and median

 incorrect geographical coordinates. The tests can be extended.

8 http://www.knightfoundation.org/ 9 http://www.vocativ.com/pages/about/

Page 41 of 119

D1.3 Open Data technological study

5.2.2.6 Technical constraints None. Based on web technology.

5.2.2.7 License Open Source, GNU General Public License

5.2.2.8 Evaluation A very handy tool for getting a first indication of data quality for the most widely used tabular formats.

5.2.2.9 Contact Website: http://dataproofer.org/ Download: https://github.com/dataproofer/Dataproofer/releases

5.2.3 Csv Lint

5.2.3.1 What is It is an online service to verify the quality of csv files. It can also be installed locally as a command-line application.

5.2.3.2 How does it look like Figure 5 shows the CVS Link environment.

Page 42 of 119

D1.3 Open Data technological study

Figure 5. CSV Lint screenshot

5.2.3.3 Developer/maintainer ODI10

5.2.3.4 Used by Lots of individuals and organisations as indicated by the log.

5.2.3.5 Functionalities The software checks for common errors and warnings. But in addition to the default list of checks one can supply a table schema in JSON that declares additional constraints for data fields. Some examples are: field is required, field value needs to be unique, to have a minimum or maximum value, the value needs to follow a certain pattern, etc. A report is generated enumerating the errors and warnings, if any.

5.2.3.6 Technical constraints Runs as a service on the web or as a command-line tool.

10 http://theodi.org/

Page 43 of 119

D1.3 Open Data technological study

The submitted file may not be larger than 700Mb.

5.2.3.7 License Open Source MIT.

5.2.3.8 Evaluation Handy and extensible tool for CSV format, which is the most used and preferred open tabular data format.

5.2.3.9 Contact Website: http://csvlint.io/ Download: https://github.com/theodi/csvlint.rb

5.3 Tools for anonymizing data

5.3.1 ARX

5.3.1.1 What is ARX is an open source software for anonymizing sensitive personal data.

Page 44 of 119

D1.3 Open Data technological study

5.3.1.2 How does it look like Figure 8 shows the ARX environment.

Figure 6. ARX screenshot

5.3.1.3 Developer/maintainer Technische Universität Munchen.

5.3.1.4 Used by Hundreds of users; the algorithms are also integrated and used in the Weka11 data mining tool.

5.3.1.5 Functionalities ARX comes with a cross-platform graphical tool, which supports data import and cleansing, wizards for creating transformation rules, ways for tailoring the anonymized dataset to the requirements and visualizations of data utility and risks. ARX is also available as a software library with an API that delivers data anonymization capabilities to any Java program.

11 http://www.cs.waikato.ac.nz/ml/weka/

Page 45 of 119

D1.3 Open Data technological study

ARX reads SQL databases, MS Excel and CSV files.

5.3.1.6 Technical constraints The GUI is available on Windows, Mac and .

5.3.1.7 License Open source (Apache License, Version 2.0)

5.3.1.8 Evaluation Very powerful environment but comes with a learning curve.

5.3.1.9 Contact Website: http://arx.deidentifier.org/ Download: http://arx.deidentifier.org/downloads/

5.3.2 Support in other tools The following tools offer data anonymization functionalities:

 RapidMiner Studio (Section 5.9.6)  Data Science Studio with the freely available Data anonymizer plugin (Section 5.4.7)  Talend Data Preparation (Section 5.4.5)

5.4 Tools for data wrangling

5.4.1 Positioning in the Open Data Lifecycle Table 5. Data wrangling positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend

Page 46 of 119

D1.3 Open Data technological study

Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

5.4.2 Introduction Data wrangling refers to data transformations to prepare for visualization, downstream analysis or operational consumption. This activity consists of an iterative cycle of transformation and profiling activities. Profiling provides descriptive analytics what is in the dataset and the respective fields/columns enabling users to decide if and how to transform the data and to evaluate if an applied transformation achieved the sought after effect. Two types of profiling are normally offered:

 Type-based  Distributional This information indicates the intended type of a column (integer, URL, etc.) and the percent of field values that satisfy the type and related constraints. Distributional profiling enables the detection of values that are too far away from what can be expected. Transformation is about the restructuring, cleaning and enrichment of a dataset to make it suited for further downstream processing. Restructuring refers to actions that change the structure of the dataset by splitting, collapsing or deleting columns. Enriching refers to the addition fields that add new information to the dataset by computing over data already in the dataset. Some examples adding a percentage next to counts, adding a sentiment analysis score next to a description field. Cleaning is focused on the values in the fields. Cleaning makes sure that the values are valid according to certain constraints. Some examples: value needs to be a positive integer between 3 and 12; value needs to be one of the EU countries.

Page 47 of 119

D1.3 Open Data technological study

5.4.3 OpenRefine

5.4.3.1 What is OpenRefine is a tool for working with messy data: cleaning it; transforming it from one format into another; and extending it via web services and external data.

5.4.3.2 How does it look like Figure 7 shows the OpenRefine environment as it is available on the browser.

Figure 7. OpenRefine screenshot

5.4.3.3 Developer/maintainer It originally was developed by Google for adding structured and cleaned data to Freebase, described as an "open, shared database of the world's knowledge". October 2nd, 2012, Google stopped supporting the project which has been taken over by a group of volunteers.

Page 48 of 119

D1.3 Open Data technological study

5.4.3.4 Used by This is the preferred tool in the Open Data space being use in many open data curricula. Some examples: pILOD Netherlands, Open Data day Flanders, ODI, TUDelft, Cooper-Hewitt National Design Museum, LSU Libraries, University of Texas.

 Librarians: DST4L – LODLAM

 Journalists: NYT, Chicago Tribune, Le Monde, The Guardian

 Open Data Communities: Sunlight Foundation, OKFN

 Educational tool: School of Data

5.4.3.5 Functionalities With OpenRefine one can import following data formats: delimited text files, fixed width, JSON, XML, ODS spreadsheets, excel, RDF. Profiling functionalities are available but not shown automatically. The user needs to choose which columns/fields he wants to investigate. Per chosen facet the distribution of values (including NA’s) is shown. Very elaborate clustering algorithms are available for merging entries with different spelling to a canonical one. There is a dedicated transformation language e.g. GREL (General Refine Expression Language) which offers functions for transforming strings, arrays and for handling math, dates and booleans. A lot of these functions can be called from the OpenRefine user interface. Furthermore, all actions and function calls are kept in a script that can used to undo and redo actions and to replay a complete transformation. A lot of emphasis also went in offering facilities to reconcile values with entities found on the web. There are extensions available to add additional functionalities. At the end, the cleansed data can be exported into excel, ODF, tab and comma separated text. With the RDF extension also in triples to be used as linked open data.

5.4.3.6 Technical constraints OpenRefine is available for Mac, Windows and Linux. It starts up as a local web server.

5.4.3.7 License Open Source.

5.4.3.8 Evaluation

5.4.3.8.1 Context of use

Page 49 of 119

D1.3 Open Data technological study

Publish Reuse + + The tool can be used both by publishers who want to make sure the to be published datasets are of good quality and by the re-user who can use the tool to ameliorate a downloaded dataset to make it better fit their own purposes.

5.4.3.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + + OpenRefine can be used by all types of users.

5.4.3.8.3 Functionalities Functionality Available Data Quality + Assessment Data Cleansing + Data + Blending/Enrichment Data format conversion + Descriptive analytics Limited Visualisation - Dashboarding - Prescriptive analytics - Storytelling / - Notebooks Covers a range of data treatment functions: QA assessment, cleansing, enrichment and blending, format conversion.

Support Available Big Data - Multidimensional data - (cubes) Linked Data +

Page 50 of 119

D1.3 Open Data technological study

One of the drawbacks of the tool is that the data that can be treated are limited to available memory, hence some difficulties in working with larger datasets. On the other hand an add-on is available to include RDF as one of the output formats.

5.4.3.8.4 Assessment Aspect Assessment Ease of installation ± Ease of use ± User centric versus user technology/science centric It can be defiant to install OpenRefine surely when extensions are involved. It lacks compared to other tools some visualisation features to easily explore the data.

5.4.3.8.5 Notes It is a very popular data wrangling solution with, in the past, a very vivid community (30 contributors and many, many forks) and lots of training material available on the web. Since some time however the level of commitment started to fall preventing a final 2.6 release from being made. Since it is not always easy to install the product, certainly not with some of the extensions, one can use in an educational context the hosting service RefinePro that prepares the necessary configuration for you as a hosted service12.

5.4.3.9 Contact Website: http://openrefine.org/ Download: http://openrefine.org/download.html

5.4.3.10 Contact Website: http://www.dataiku.com/dss/ Download: http://www.dataiku.com/dss/trynow/

12 http://refinepro.com/hosting/

Page 51 of 119

D1.3 Open Data technological study

5.4.4 Trifacta Wrangler

5.4.4.1 What is Trifacta Wrangler is a tool specifically made to help a non-programmer address all data wrangling tasks in a very interactive way. It is the commercial successor of Stanford’s Data Wrangler13.

5.4.4.2 How does it look like Figure 8 shows the Trifacta Wrangler environment.

Figure 8. Trifacta Wrangler screenshot

5.4.4.3 Developer/maintainer Trifacta14.

5.4.4.4 Used by The New York Times, Atlassian, McGrawHill Education, Google etc.

5.4.4.5 Functionalities Trifacta Wrangler allows you to import text delimited files, json and Microsoft excel files.

13 http://vis.stanford.edu/wrangler/ 14 https://www.trifacta.com/

Page 52 of 119

D1.3 Open Data technological study

Once loaded the software offers several profiling functionalities. It infers the data type and other properties of each field/column. It shows for each column a bar indicating the quality and using a histogram the distribution of the values. It has transformation functions for:  Restructuring (splitting/collapsing columns, deleting fields, pivoting rows)

 Cleaning (replacing/deleting missing or mismatched values)  Enriching (joining multiple data sources) These transformations are suggested based on user actions (clicking a diagram, selecting text, …) All these transformations are translated in a domain-specific declarative language called Wrangle (which can be edited by a tech savvy user) and recorded in a script that can be rerun leading to reproducible results. At the end of the process the software helps you to validate the wrangling script on the full dataset. Once satisfied one can publish to CSV, JSON and Tableau.

5.4.4.6 Technical constraints Windows or OSX (Mac) with min 4GB RAM and 2 GB free disk space and an Internet connection. One needs to register.

5.4.4.7 License Commercial but the desktop version is free.

5.4.4.8 Evaluation

5.4.4.8.1 Context Publish Reuse + + The tool can be used both by publishers who want to make sure the to be published datasets are of good quality and by the re-user who can use the tool to ameliorate a downloaded dataset to make it better fit their own purposes.

5.4.4.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + +

Page 53 of 119

D1.3 Open Data technological study

The software is really aimed at business users, but can of course be used by everyone. OpenRefine potentially better suits the mind set of students and programmers.

5.4.4.8.3 Functionalities Functionality Available Data Quality + Assessment Data Cleansing + Data + Blending/Enrichment Data format conversion Limited Descriptive analytics + Visualisation - Dashboarding - Prescriptive analytics - Storytelling / - Notebooks Everything a user needs to bring a dataset into the wished shape for further treatment.

Support Available Big Data (only in the enterprise version) Multidimensional data - (cubes) Linked Data - Big Data support is available but at a price and no support for linked data since not that popular in a business environment.

5.4.4.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus User technology/science

Page 54 of 119

D1.3 Open Data technological study

centric The leading tool and in functionalities and user support for data wrangling.

5.4.4.8.5 Notes This is one of the most comprehensive and scientifically well-grounded data wrangling solutions with the most visual and interactive interface easily understandable by business users. Highly recommended.

5.4.4.9 Contact Website: https://www.trifacta.com/products/wrangler/ Download: https://www.trifacta.com/start-wrangling/

5.4.5 Talend Data Preparation Free Edition

5.4.5.1 What is Talend Data Preparation allows you to read a dataset, define a recipe with the necessary transformations and generate an output based on this preparation.

5.4.5.2 How does it look like Figure 9 shows the Talend Data Preparation environment.

Page 55 of 119

D1.3 Open Data technological study

Figure 9. Talend Data Preparation screenshot

5.4.5.3 Developer/maintainer Talend15.

5.4.5.4 Used by Not known.

5.4.5.5 Functionalities Talend Data Preparation reads CSV and Microsoft excel files. Once loaded the software offers several profiling functionalities. It infers the data type and other properties of each field/column. It shows for each column a bar indicating the quality and using a histogram the distribution of the values. It has transformation functions for:

 Restructuring (splitting columns, deleting fields).  Cleaning (replacing/deleting missing or mismatched values).

 Enriching (joining multiple data sources). These transformations are suggested depending on a row or column selection. However, deduplication is not yet supported. Transformations are kept in a script that can be rerun and exchanged. The results of the preparation can be exported to csv and xlsx.

5.4.5.6 Technical constraints Windows and OSX with 1GB RAM and 5GB Hard disk space.

5.4.5.7 License Unknown.

5.4.5.8 Evaluation

5.4.5.8.1 Context of use Publish Reuse

15 https://www.talend.com/

Page 56 of 119

D1.3 Open Data technological study

+ + The tool can be used both by publishers who want to make sure the to be published datasets are of good quality and by the re-user who can use the tool to ameliorate a downloaded dataset to make it better fit their own purposes.

5.4.5.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + Can be used by all users, but we can imagine that a person with programming skills would like to use a more tune-able solution.

5.4.5.8.3 Functionalities Functionality Available Data Quality + Assessment Data Cleansing + Data + Blending/Enrichment Data format conversion Limited Descriptive analytics + Visualisation - Dashboarding - Prescriptive analytics - Storytelling / - Notebooks All functionalities needed to bring a dataset into the wished shape for further processing.

Support Available Big Data ? Multidimensional data - (cubes) Linked Data - No evidence found for support of these more specialised data formats.

Page 57 of 119

D1.3 Open Data technological study

5.4.5.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus user technology/science centric

5.4.5.8.5 Notes It is a good wrangling environment but less elaborate and less interactive than Trifacta’s Wrangler.

5.4.5.9 Contact Website: https://www.talend.com/products/data-preparation Download: https://www.talend.com/download/talend-open-studio#t8

5.4.6 Exploratory

5.4.6.1 What is A desktop application for data wrangling, visualization, and advanced analytics based on the R programming Language

5.4.6.2 How does it look like Figure 10 shows the Exploratory platform’s environment.

Page 58 of 119

D1.3 Open Data technological study

Figure 10. Exploratory screenshot

5.4.6.3 Developer/maintainer Exploratory16.

5.4.6.4 Used by Unknown; the product is still in bèta.

5.4.6.5 Functionalities In this part we only focus on the data wrangling functionalities. Exploratory allows to read from the local file system delimited and fixed width files, excel, SPSS/SAS/STATA, R data files and JSON. Supported remote data sources are: MongoDB, MySQL, Redshift, PostgreSQL, Google spreadsheets, BigQuery etc.

16 https://exploratory.io/

Page 59 of 119

D1.3 Open Data technological study

One a data sourced opened, the data are shown in a summary view indicating for each column the inferred data type, an indication of the empty values and a histogram for showing the distribution of the values. Exploratory comes with a range of transformation functions as foreseen by the R package dplyr, being a grammar for data manipulation. These transformation functions can be called and run by choosing menu items in the interface. The dyplr package has transformation functions for: reshaping, subsetting rows and columns, summarizing, making new columns, grouping data and combining datasets.

5.4.6.6 Technical constraints Still in beta. Available on Mac and Windows and connects to the Exploratory website. Installs also R.

5.4.6.7 License Unclear. Commercial with a free plan. Limitations of the free plan are not known yet.

5.4.6.8 Evaluation

5.4.6.8.1 Context of use Publish Reuse + More aimed at exploratory analytics, hence more reuse centered.

5.4.6.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + Since this in in fact an interface above the dplyr library of the statistical programming language R, it is obvious this is more oriented to people with programming skills or a data science profile.

5.4.6.8.3 Functionalities Functionality Available Data Quality + Assessment Data Cleansing + Data + Blending/Enrichment

Page 60 of 119

D1.3 Open Data technological study

Data format conversion - Descriptive analytics + Visualisation + Dashboarding - Prescriptive analytics + via R Storytelling / - Notebooks Almost everything you need as a data scientist thanks to the R integration.

Support Available Big Data ? Multidimensional data + via pivot tables (cubes) Linked Data - No support for linked data; pivot tables are available for exploring multidimensional data.

5.4.6.8.4 Assessment Aspect Assessment Ease of installation + Ease of use ± User centric versus Technology/science technology/science centric Very good tool but for a specific audience.

5.4.6.8.5 Notes This is a very good environment for more technical oriented people to learn in a user friendly way how to do data wrangling the R way and hence offering an easy step-up to the rest of the R world.

5.4.6.9 Contact Website: https://exploratory.io/ Download: https://exploratory.io/download

Page 61 of 119

D1.3 Open Data technological study

5.4.7 Dataiku Data Science Studio Free Edition

5.4.7.1 What is It is an integrated development platform for data professionals to turn raw data into predictions in a collaborative way.

5.4.7.2 How does it look like Figure 11 shows the Data Science Studio environment.

Figure 11. Data Science Studio screenshot

5.4.7.3 Developer/maintainer Dataiku17.

5.4.7.4 Used By Axa, l’Oréal, Cap Gemini, Coyote. DSS is one of the fastest growing products in the data science space.

17 http://www.dataiku.com/

Page 62 of 119

D1.3 Open Data technological study

5.4.7.5 Functionalities Limited to the data wrangling field. The free version connects to file systems (local, via http, ftp etc.) supporting delimited and fixed text files, JSON, excel. Also connections to MySQL and PostgreSQL are available. When opening a data source, it offers data profiling automatically detecting the contained datatypes, indicating the percentages of empty and wrong values. Furthermore, the distributions per field can be shown for outlier detection and clustering algorithms can be applied comparable to those used in OpenRefine (5.4.3). Next to the profiling functionalities datasets can be sampled, split, grouped, joined with other datasets, reshaped (restructured) and the values of the cells can be transformed and cleansed. This is done with a very elaborate library containing 81 processors.

5.4.7.6 Technical constraints Available on Mac OS, Windows, Linux and as a VMWare or a VirtualBox. It starts up a local web server using a connection to the Dataiku server.

5.4.7.7 License Commercial with a limited free version.

5.4.7.8 Evaluation

5.4.7.8.1 Context of use Publish Reuse + Only for the data re-user with a data science profile.

5.4.7.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills ± + + Aimed at Data Science profiles addressing both coders and non-coders.

5.4.7.8.3 Functionalities Functionality Available Data Quality + Assessment

Page 63 of 119

D1.3 Open Data technological study

Data Cleansing + Data + Blending/Enrichment Data format conversion - Descriptive analytics + Visualisation + Dashboarding Only in the enterprise version Prescriptive analytics + Storytelling / + Notebooks Everything needed for doing data science and preparing the data accordingly.

Support Available Big Data Only in the enterprise version Multidimensional data - (cubes) Linked Data - Big Data supported but at a cost.

5.4.7.8.4 Assessment Aspect Assessment Ease of installation + Ease of use ± User centric versus A mix technology/science centric Too overwhelming for an occasional user, but one of the most comprehensive data science suites in the market.

5.4.7.8.5 Notes Covers the whole field of data wrangling but the integration within a larger framework aimed at Machine Learning with polyglot programming support can be daunting for occasional users.

Page 64 of 119

D1.3 Open Data technological study

5.5 Tools to convert between data formats

5.5.1 Positioning in the Open Data Lifecycle Table 6. Format conversion positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

5.5.2 OpenRefine OpenRefine (cf.supra) takes TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents as input docs and is able to convert those into TSV, CSV, excel (xls, xlsx), ODF spreadsheet and in any text format using templates.

5.5.3 The DataTank

5.5.3.1 What is It is a server that connects to the source dataset and converts these in other formats and exposes the dataset with a RESTful API

5.5.3.2 How does it look like Figure 12 shows the environment of the DataTank tool.

Page 65 of 119

D1.3 Open Data technological study

Figure 12. The DataTank screenshot

5.5.3.3 Developer/maintainer Open Knowledge Belgium18.

5.5.3.4 Used by The Flemish Open Data portal, the cities of Antwerp, Ghent, Kortrijk.

5.5.3.5 Functionalities Captures DCAT-AP compliant metadata of datasets. Import from: CSV, XLS, XML, JSON-LD, SHP, JSON, RDF, SPARQL stores, MySQL stores. Publishing as CSV, JSON, XML, RDF. Depending on the data content data can be presented as HTML or as a map.

5.5.3.6 License Open source.

18 http://www.openknowledge.be/

Page 66 of 119

D1.3 Open Data technological study

5.5.3.7 Evaluation

5.5.3.7.1 Context of use Publish Reuse + Potential addition to CKAN for offering data conversion facilities and a data API.

5.5.3.7.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + Only relevant for data publishers and for organisers of hackatons or other coders to quickly setup programming API’s.

5.5.3.7.3 Functionalities Functionality Available Data Quality - Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion + Descriptive analytics - Visualisation + Dashboarding - Prescriptive analytics - Storytelling / - Notebooks Focussing on a few publishing functionalities.

Support Available Big Data + Multidimensional data - (cubes)

Page 67 of 119

D1.3 Open Data technological study

Linked Data + Supports big data and one can define conversions to RDF for integration in the Linked Open Data world.

5.5.3.7.4 Assessment Aspect Assessment Ease of installation ? Ease of use + User centric versus Technology technology/science centric Can be used as add-on to CKAN for addressing its major shortcomings.

5.5.3.7.5 Notes System aimed at publishing datasets in a variety of open formats.

5.5.3.8 Contact Website: http://www.thedatatank.com/ Download: https://github.com/tdt/core.git

5.6 Tools to analyse and visualize the data

5.6.1 Positioning in the Open Data Lifecycle Table 7. Analysing data positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend

Page 68 of 119

D1.3 Open Data technological study

Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

5.6.2 Introduction What is meant here for analysis is the ability to generate descriptive statistics on the dataset with two major areas:

 Info on the distribution of a single variable by calculating min, max, median, quartiles, mean and standard deviation using graphics such as histograms and box plots  Bivariate analysis results indicating correlation and covariance using scatterplots, … Other types of charts and graphics e.g. using maps for location related data, timelines for time related observations clearly help in gaining understanding of the data. Important to mention is that the IDL research group (http://idl.cs.washington.edu/) offers a recommendation engine for suggesting the most appropriate visualisation based on the nature of your data and your exploration needs.

5.6.3 Data Voyager

5.6.3.1 What is Data Voyager is a visualization browser for open-ended data exploration. It is built using Vega-Lite, a high-level visualization grammar.

5.6.3.2 How does it look like Figure 13 shows the Data Voyager environment.

Page 69 of 119

D1.3 Open Data technological study

Figure 13. Data Voyager screenshot

5.6.3.3 Developer/maintainer University of Washington Interactive Data Lab led by Geoffrey Heer, also co-founder of Trifacta19.

5.6.3.4 Used by Not known. De underlying Vega technology is used in other products (cf. 5.7.4).

5.6.3.5 Functionalities Voyager creates a gallery of automatically-generated visualizations which can be navigated in a interactive way. After loading a dataset the system detects the datatypes of the fields and calculates descriptive analytics: min, max, mean, standard deviation, median, a sample etc. And for every field a visualization is proposed using a recommendation engine using the type and distribution of data as input.

19 http://idl.cs.washington.edu/

Page 70 of 119

D1.3 Open Data technological study

In this overview one can select a field of interest and then Voyager automatically updates the view with relevant visualizations of the chosen field and its relation to all the others. When you combine a field with another field scatterplots are build. If the user sees a visualization of interest it can be bookmarked for further use. The authors indicate that this type of interaction is better suited for free exploration of a dataset. When a specific analytics question needs to be addressed they offer a companion product named Polestar20 which offers an approach comparable to Tableau.

5.6.3.6 Technical constraints Data Voyager is available as a web service. It can also be locally installed. There is a dependency then on node.js.

5.6.3.7 License Open Source.

5.6.3.8 Evaluation

5.6.3.8.1 Context of use Publish Reuse + Tool for exploratory analysis of data by data re-users.

5.6.3.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + + It is made to show all types of users the nature of the data, potential issues and correlations between fields.

5.6.3.8.3 Functionalities Functionality Available Data Quality + Assessment

20 http://vega.github.io/polestar/

Page 71 of 119

D1.3 Open Data technological study

Data Cleansing - Data - Blending/Enrichment Data format conversion - Descriptive analytics + Visualisation + Dashboarding - Prescriptive analytics - Storytelling / - Notebooks Offers optimised visualisations to ‘explain’ the data. It is aimed at the exploration of a dataset by offering graphics that tell what’s inside the data, not to define and build a visualisation by an end- user.

Support Available Big Data ? Multidimensional data - (cubes) Linked Data - No indications found.

5.6.3.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus user technology/science centric Very comprehensive solution to detect what’s in the data.

5.6.3.8.5 Notes A very fresh approach to data exploration, making sure that all potential outliers, correlations, … are easily found.

Page 72 of 119

D1.3 Open Data technological study

5.6.3.9 Contact Website: https://vega.github.io/voyager/ Download: https://github.com/vega/voyager

5.6.4 Dataseed

5.6.4.1 What is Dataseed is an online platform for interactive data visualisation, analysis and reporting.

5.6.4.2 How does it look like Figure 14 presents the Dataseed environment.

Figure 14. Dataseed screenshot

5.6.4.3 Developer/maintainer Atchai21.

21 http://atchai.com/

Page 73 of 119

D1.3 Open Data technological study

5.6.4.4 Used by Harvest, Tailster, Resultsmark, hscic.

5.6.4.5 Functionalities One can upload spreadsheet files or connect with Google Drive, Github, and DropBox. Data will be automatically aggregated and visualised. The charts are chosen based on the nature of the data and are clickable for filtering and further exploration. The automatically generated charts can be ameliorated and adapted to personal preferences. The charts can be published and shared. There is an open-source toolkit that allows to create custom visualisations driven by the dataseed back-end.

5.6.4.6 Technical constraints The open source toolkit has dependencies on nodeJS and npm.

5.6.4.7 License For the open source toolkit the GNU Affero General Public License.

5.6.4.8 Evaluation

5.6.4.8.1 Context of use Publish Reuse + Automatically builds graphics to get insight in the data. These can be further worked upon and shared.

5.6.4.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + Tool for non-coders to get an insight in the data. People with coding skills will look at alternatives. Public visualisations are for free hence easily to be incorporated in public sector open data analysis and/or publishing flows.

5.6.4.8.3 Functionalities Functionality Available Data Quality -

Page 74 of 119

D1.3 Open Data technological study

Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion - Descriptive analytics + Visualisation + Dashboarding - Prescriptive analytics - Storytelling / - Notebooks More focus on descriptive analytics than the other graphics tools.

Support Available Big Data - Multidimensional data - (cubes) Linked Data -

5.6.4.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus User technology/science centric Easy and good looking tool to get an overview of a dataset.

5.6.4.8.5 Notes Very good automatic tool for exploring a dataset in a visual way.

5.6.4.9 Contact Website: https://getdataseed.com/

Page 75 of 119

D1.3 Open Data technological study

Download: https://github.com/dataseed/dataseed-visualisation.js

5.6.5 Tableau Desktop Public

5.6.5.1 What is Tool to visualize and share data.

5.6.5.2 How does it look like Figure 15 shows the Tableau environment.

Figure 15. Tableau screenshot

5.6.5.3 Developer/maintainer Tableau22.

5.6.5.4 Used by More than 190.000 users are known.

22 http://www.tableau.com/

Page 76 of 119

D1.3 Open Data technological study

5.6.5.5 Functionalities One can open Excel, text files, statistical files and connect to Google Sheets and OData servers as data sources. The software detects automatically the data type of each field and table restructuring functions are available. For each data source one is able to define multiple worksheets visualisations.The complete list of possibilities is: text tables, bar, line, pie, map, scatter plot, gannt, bubble, histogram, bubble, heat, highlight, treemap, box-and-whisker plot. Once a visualisation built one can overlay it with analytics indicators such as an average line, the median with quartiles, a distribution band, box plot, … and the latest version v10.0 also offers clustering. It is also important to notice that Tableau Desktop is able to work with multidimensional cubes as found in statistical datasets.

5.6.5.6 Technical constraints Available for Windows and Mac.

5.6.5.7 License Commercial, but a free public version is available. The visualisations made by this version become publicly available on the Tableau cloudserver.

5.6.5.8 Evaluation

5.6.5.8.1 Context of use Publish Reuse + Aimed at building tables, graphics, maps, … from data to gain insights and communicate those to others.

5.6.5.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + Can be used by anyone, but is more aimed at business users (without coding skills).

Page 77 of 119

D1.3 Open Data technological study

5.6.5.8.3 Functionalities Functionality Available Data Quality Assessment Data Cleansing Data + Blending/Enrichment Data format conversion Descriptive analytics + Visualisation + Dashboarding + Prescriptive analytics Storytelling / + Notebooks Everything to build graphics for getting and communicating insights.

Support Available Big Data - (available on enterprise version) Multidimensional data + (cubes) Linked Data - Big Data but at a cost. Very good support for multi-dimensional data.

5.6.5.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus user technology/science centric

Page 78 of 119

D1.3 Open Data technological study

Very good, relatively easy and nice looking tool for data visualisation and communication on all devices.

5.6.5.8.5 Notes One of the leading and immensely popular players in the visualisation/analysis field.

5.6.5.9 Contact Website: https://public.tableau.com/s/ Download: https://public.tableau.com/en-us/s/download/thanks

5.6.6 Exploratory

5.6.6.1 Info More info on the tool at Section 5.4.6.

5.6.6.2 Relevant functionalities For the single variables min, max, mean and median are calculated and the distribution is shown using histograms. The available chart types are: bar, line, area, histogram, scatter, boxplot, map, cloropleth, heatmap, contour. Bivariate analysis can be done by constructing scatterplots.

5.6.7 Dataiku Data Science Studio

5.6.7.1 Info More info on the tool at Section 5.4.7

5.6.7.2 Relevant functionalities Dataiku offers: min, max, mean, median, standard deviation, distinct values and a histogram of the distribution. Available chart types are: bars, bars 100%, histogram, staked, stacked 100%, lines, stacked area, 100% stacked area, pie, donut, scatter plot, bubble, hexagon, grouped bubbles, scatter and grid maps.

Page 79 of 119

D1.3 Open Data technological study

5.7 Tools to visualize the data

5.7.1 Positioning in the Open Data Lifecycle Table 8. Visualisation positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

5.7.2 Introduction In this chapter we describe the main tools and frameworks to create graphics from data.

5.7.3 Google Fusion Tables

5.7.3.1 What is Google Fusion Tables is a web application to visualize and share data tables.

5.7.3.2 Developer/maintainer Google.

5.7.3.3 How does it look like Figure 16 shows the Google Fusion Table platform.

Page 80 of 119

D1.3 Open Data technological study

Figure 16. Google Fusion Table screenshot

5.7.3.4 Used by Users are: the Guardian, the Toronto Globe and Mail, UCSF Global Health Sciences, Honda, Texas Tribune etc.

5.7.3.5 Functionalities The service allows to:  filter and summarize data  combine data with other datasets  visualize the data using a chart, map, network graph, or custom layout  embed and share  offer an API to the data

5.7.3.6 Technical constraints A modern web browser.

5.7.3.7 License Google’s terms of service.

Page 81 of 119

D1.3 Open Data technological study

5.7.3.8 Evaluation

5.7.3.8.1 Context of use Publish Reuse + Tool to reuse the data for building graphics.

5.7.3.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + Easy to use for everyone, although I can imagine that people with coding skills prefer more low level tools for better control.

5.7.3.8.3 Functionalities Functionality Available Data Quality - Assessment Data Cleansing - Data + Blending/Enrichment Data format conversion - Descriptive analytics - Visualisation + Dashboarding - Prescriptive analytics - Storytelling / - Notebooks

Support Available Big Data - Multidimensional data - (cubes)

Page 82 of 119

D1.3 Open Data technological study

Linked Data - No evidence found for support of these specialised types of data.

5.7.3.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus user technology/science centric Easy to use tool.

5.7.3.8.5 Notes Free unlimited and powerful environment.

5.7.3.9 Contact Website: https://support.google.com/fusiontables/answer/2571232 Download: https://chrome.google.com/webstore/detail/fusion-tables- experimenta/pfoeakahkgllhkommkfeehmkfcloagkl

5.7.4 Vega and Vega-Lite

5.7.4.1 What is Vega is a declarative format for creating, saving, and sharing visualizations. With Vega, visualizations are described in JSON, and interactive views can be generated using either HTML5 Canvas or SVG. Vega-Lite provides a higher-level grammar for visual analysis that generates complete Vega specifications. Online text editors for both syntaxes are available:  https://vega.github.io/vega-editor/?spec=bar for Vega  https://vega.github.io/vega-editor/?mode=vega-lite for Vega-Lite There is also an online design environment named LYRA that enables custom visualization design without writing any code (https://idl.cs.washington.edu/projects/lyra/app/).

5.7.4.2 How does it look like Figure 17 presents the Vega environment.

Page 83 of 119

D1.3 Open Data technological study

Figure 17. Vega screenshot

5.7.4.3 Developer/maintainer IDL23.

5.7.4.4 Used by The Vega family is used in Trifacta, DataVoyager, PoleStar. Vega can be used from Python, Julia, R and is integrated in ggvis, MediaWiki and Cedar.

5.7.4.5 Functionalities Vega-Lite allows to describe a visualization as a set of encodings that map data fields to the properties of graphical marks, using a JSON format.

23 http://idl.cs.washington.edu/

Page 84 of 119

D1.3 Open Data technological study

Vega-Lite supports data transformations such as aggregation, binning, filtering, and sorting and layout transformations including stacked layouts and faceting into small multiples.

5.7.4.6 Technical constraints Client-side it is just using javascript and web components. Server-side there is a dependency on node.js.

5.7.4.7 License Open Source

5.7.4.8 Evaluation

5.7.4.8.1 Context of use Publish Reuse + + Framework to be used by coders or as consumer or as publisher.

5.7.4.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + Limited coding skills are needed. I can imagine that this framework also appeals to open data portal publishers to add graphics to the published datasets.

5.7.4.8.3 Functionalities Functionality Available Data Quality _ Assessment Data Cleansing _ Data + Blending/Enrichment Data format conversion _ Descriptive analytics - Visualisation + Dashboarding _

Page 85 of 119

D1.3 Open Data technological study

Prescriptive analytics _ Storytelling / _ Notebooks

Support Available Big Data _ Multidimensional data _ (cubes) Linked Data _ Unknown.

5.7.4.8.4 Assessment Aspect Assessment Ease of installation _ Ease of use + User centric versus Technology (grammar technology/science of graphics) focused centric A framework surely to be considered by people with coding skills.

5.7.4.8.5 Notes Very interesting approach since the library uses knowledge about how to render visualisations for analysis purposes.

5.7.4.9 Contact Website: https://vega.github.io/vega/ Download: https://github.com/vega/vega

5.7.5 Plotly

5.7.5.1 What is Plotly is a cloud environment for the creation of data visualisations and dashboards with integrated collaboration facilities. Next to the cloud environment high-level, declarative charting libraries in R, Python, JavaScript and MatLab are available. The JS library has been open sourced.

Page 86 of 119

D1.3 Open Data technological study

5.7.5.2 How does it look like Figure 18 shows the Plotly environment.

Figure 18. Plotly screenshot

5.7.5.3 Developer/maintainer Plotly24.

5.7.5.4 Used by Google, US Airforce, New York University, NetFlix a.o.

5.7.5.5 Functionalities Data can be imported. Data formats supported vary depending on the subscription level. The main focus is on high end visualization. The supported charts however once again depend on the subscription level.

24 https://plot.ly/

Page 87 of 119

D1.3 Open Data technological study

Plotly v2 also allows to generate descriptive analytics per field: mean, medium, quartiles, standard deviation, variance.

5.7.5.6 Technical constraints None for the cloud service except for a modern browser.

5.7.5.7 License Plotly is free for unlimited public use. The Javascript library is open source. API use can be limited according to the plan subscribed to.

5.7.5.8 Evaluation

5.7.5.8.1 Context of use Publish Reuse + + Same as Vega.

5.7.5.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + + Plot.ly offers in addition to all kinds of support for programming in different languages, some entry points to people without coding skills.

5.7.5.8.3 Functionalities Functionality Available Data Quality - Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion - Descriptive analytics + Visualisation +

Page 88 of 119

D1.3 Open Data technological study

Dashboarding + Prescriptive analytics - Storytelling / - Notebooks

Support Available Big Data ? Multidimensional data - (cubes) Linked Data -

5.7.5.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus mix technology/science centric If having programming skills, surely to be looked at. Publishers can also use Plot.ly to enhance their cataloguing solution by adding graphics. E.g. Data.gov is doing this.

5.7.5.8.5 Notes Very powerful environment. The free version however offers a more limited choice of available graphics.

5.7.5.9 Contact Website: https://plot.ly/ Download: https://plot.ly/api/

5.7.6 Quadrigram

5.7.6.1 What is A visual drag-and-drop data editor allowing to create interactive visualisations without coding. Multiple components can be combined for storytelling and everything made is shareable.

Page 89 of 119

D1.3 Open Data technological study

5.7.6.2 How does it look like Figure 19 shows the Quadrigram editor’s environment.

Figure 19. Quadrigram screenshot

5.7.6.3 Developer/maintainer Bestiario25.

5.7.6.4 Used by Mostly individuals.

5.7.6.5 Functionalities One can load data and store them on Google Drive. Data can be filtered, aggregated and sorted. Following charts are available: bar, scatter, stacked bar, stacked area, a map. Charts can be connected. The result can be published and shared over the internet.

25 http://www.bestiario.org/

Page 90 of 119

D1.3 Open Data technological study

The latest version also includes pivot tables for handling multidimensional values.

5.7.6.6 Technical constraints None.

5.7.6.7 License Terms of Use.

5.7.6.8 Evaluation

5.7.6.8.1 Context of use Publish Reuse + A tool for reusing data to build graphics and stories.

5.7.6.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + Environment made for non-coders.

5.7.6.8.3 Functionalities Functionality Available Data Quality - Assessment Data Cleansing - Data + Blending/Enrichment Data format conversion - Descriptive analytics - Visualisation + Dashboarding + Prescriptive analytics - Storytelling / + Notebooks

Page 91 of 119

D1.3 Open Data technological study

Dashboarding and storytelling facilities are included.

Support Available Big Data - Multidimensional data - (cubes) Linked Data -

5.7.6.8.4 Assessment Aspect Assessment Ease of installation + Ease of use medium User centric versus user technology/science centric An IDE for building graphics and stories, but not always trivial to use.

5.7.6.9 Contact Website: http://www.quadrigram.com/

5.7.7 Datawrapper

5.7.7.1 What is Datawrapper is a service to create and publish graphics.

5.7.7.2 What does it look like Figure 20 shows the Datawrapper service.

Page 92 of 119

D1.3 Open Data technological study

Figure 20. Datawrapper screenshot

5.7.7.3 Developer/maintainer Journalism++ Cologne26.

5.7.7.4 Used by Mostly used in the news and journalism domains: The Guardian, Washington Post, de Standaard.

5.7.7.5 Functionalities The available chart types are: line, bar, stacked bar, map, donut and table.

5.7.7.6 Technical constraints None.

5.7.7.7 License Free software under MIT license.

26 http://www.jplusplus.org/de/cologne/

Page 93 of 119

D1.3 Open Data technological study

5.7.7.8 Evaluation

5.7.7.8.1 Context of use Publish Reuse + Tool for easily building graphics from data.

5.7.7.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + Tool aimed at non-coders.

5.7.7.8.3 Functionalities Functionality Available Data Quality - Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion - Descriptive analytics - Visualisation +

Dashboarding - Prescriptive analytics - Storytelling / - Notebooks

Support Available Big Data - Multidimensional data - (cubes) Linked Data -

Page 94 of 119

D1.3 Open Data technological study

5.7.7.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus user technology/science centric Ease of use is good and the result graphics are also clean and clear.

5.7.7.9 Contact Website: https://www.datawrapper.de/ Download: https://github.com/datawrapper/datawrapper

5.7.8 Raw

5.7.8.1 What is A web service for creating graphs which are not easily available in other tools: alluvial, bump, circle, circular dendogram, cluster dendogram, clustered force layout, convex hull, hexagonal bins, parallel coordinates, steamgraph, … and hence no support for pie charts, histograms or line charts

5.7.8.2 How does it look like Figure 21 shows the Raw environment.

Page 95 of 119

D1.3 Open Data technological study

Figure 21. Raw screenshot

5.7.8.3 Developer/maintainer Density Design Research Lab27.

5.7.8.4 Used by Unknown.

5.7.8.5 Functionalities Choose a graphic type and customise the graphic. Once satisfied one can export the graphic as svg or png. Raw is highly extensible and is accessible by developers via API.

5.7.8.6 Technical constraints Raw can also be locally installed and then there are dependencies on , bower and python.

27 http://www.densitydesign.org/projects/

Page 96 of 119

D1.3 Open Data technological study

5.7.8.7 License Open license (LGPL license).

5.7.8.8 Evaluation

5.7.8.8.1 Context of use Publish Reuse + For making more exotic types of graphics from data.

5.7.8.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + For non-coders that want to build graphics not available in other tools and services.

5.7.8.8.3 Functionalities Functionality Available Data Quality _ Assessment Data Cleansing _ Data _ Blending/Enrichment Data format conversion _ Descriptive analytics _ Visualisation + Dashboarding _ Prescriptive analytics _ Storytelling / _ Notebooks Only for creating very specialised graphics.

Support Available Big Data _

Page 97 of 119

D1.3 Open Data technological study

Multidimensional data - (cubes) Linked Data - No support for these specific datatypes.

5.7.8.8.4 Assessment Aspect Assessment Ease of installation + Ease of use ± User centric versus Focus on lesser known technology/science graphics type. centric Only for very specific uses.

5.7.8.8.5 Notes Is not well suited for very large datasets and doesn’t support all basic graphics.

5.7.8.9 Contact Website: http://raw.densitydesign.org/ Download: https://github.com/densitydesign/raw/

5.8 Tools for building dashboards and doing storytelling

5.8.1 Positioning in the Open Data Lifecycle Table 9. Dashboards/Storytelling positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Page 98 of 119

D1.3 Open Data technological study

Data level Enhance, Blend Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

5.8.2 Tableau Public Desktop Tableau allows to build dashboards. A dashboard is a collection of several worksheets and supporting information shown in a single place so you can compare and monitor a variety of data simultaneously. When you create a dashboard, you can add views from any worksheet. You can also add a variety of supporting objects such as text areas, web pages, and images. From the dashboard, you can format, annotate, drill-down, edit axes, and more. Tableau also supports story building. A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey information. One can create stories to show how facts are connected, provide context, demonstrate how decisions relate to outcomes, or simply make a compelling case..

5.8.3 Plotly Online Dashboards Dashboards.ly is an open source web application for arranging plotly graphs into web dashboards.

5.8.4 Quadrigram Quadrigram offers a canvas where multiple components (graphics, maps, shapes, text and media) can be combined.

5.9 Tools for predictive analytics

5.9.1 Positioning in the Open Data Lifecycle Table 10. Predictive analytics positioning Publish Re-use Collect Obtain Portal navigation and search Prepare QA and cleanse Download/via API

Page 99 of 119

D1.3 Open Data technological study

Transform to Format Scrub Cleanse (remove, change data types, handle missing data) Assign Dataset level Transform Metadata (including open license)

Data level Enhance, Blend Publish Bulk Explore Visualize (table, graphs, plots) API Derive stats Make Portal Model Explain, Predict Discoverable Maintain Interpret and Story telling communicate

5.9.2 Predictive analytics Predictive analytics uses data collected in the past to predict what will happen in the future.

5.9.3 BigML

5.9.3.1 What is BigML is a webservice that lets you build models and make predictions with these models.

5.9.3.2 How does it look like Figure 22 shows the BigML service.

Page 100 of 119

D1.3 Open Data technological study

Figure 22. BigML screenshot

5.9.3.3 Developer/maintainer BigML Inc.28

5.9.3.4 Used by Datatricks, Persontyle, Quintl etc.

5.9.3.5 Functionalities BigML offers: decision trees, ensemble learning, clustering, anomaly detection and association discovery. Next to the web interface, there is a Mac app, an open source command-line tool and programming language bindings to Python, Java, node.js, clojure, swift.

5.9.3.6 Technical constraints For the web interface a modern browser.

28 https://bigml.com/

Page 101 of 119

D1.3 Open Data technological study

5.9.3.7 License See terms of service.

5.9.3.8 Evaluation

5.9.3.8.1 Context of use Publish Reuse + Reusing data for doing prescriptive analytics.

5.9.3.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + + Normally prescriptive analytics is thought to be the domain of data scientists, but this tool makes machine learning, etc. so easy that it can be easily understood by laymen and business users.

5.9.3.8.3 Functionalities Functionality Available Data Quality - Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion - Descriptive analytics + Visualisation - Dashboarding - Prescriptive analytics + Storytelling / - Notebooks Focusing on prescriptive analytics and machine learning.

Support Available Big Data ?

Page 102 of 119

D1.3 Open Data technological study

Multidimensional data - (cubes) Linked Data - No.

5.9.3.8.4 Assessment Aspect Assessment Ease of installation + Ease of use + User centric versus user technology/science centric To our knowledge, one of the few prescriptive analytics products that can be used by non data scientists.

5.9.3.8.5 Notes BigML’s ambition is to bring the ability to do predictive analytics in the hands of business people, e.g. non data scientists. According to our experience they come close. Of interest is also that they offer an educational program.

5.9.3.9 Contact Website: https://bigml.com/ Download: https://bigml.com/tools

5.9.4 DataScienceStudio DataScienceStudio offers decision trees and clustering, leveraging ML technologies (Scikit-Learn, MLlib, XGboost, etc.) One can build & optimize models in Python or R and integrate any external ML library through code APIs (H2O, Skytree, etc.), and get instant visual & statistical feedback on the performance of the model.

5.9.5 SkyTree Express single-user Desktop

5.9.5.1 What is SkyTree is a machine learning environment.

Page 103 of 119

D1.3 Open Data technological study

5.9.5.2 How does it look like Figure 23 shows the SkyTree Desktop Edition environment.

Figure 23. SkyTree Desktop Edition screenshot

5.9.5.3 Developer/maintainer SkyTree29.

5.9.5.4 Used by Amex, PayPal, Thomson Reuters, Panasonic.

5.9.5.5 Functionalities A machine-learning Platform accessible via Python, Java, GUI or Command Line that automatically selects parameters and builds models, offers visualisations to explain the model results and automatically documents the whole process.

5.9.5.6 Technical constraints The single-user Desktop GUI version needs to be installed in a VirtualBox.

29 http://www.skytree.net/

Page 104 of 119

D1.3 Open Data technological study

Minimum hardware requirements: 8 GB RAM, 2 physical cores. The free version is limited to 100 million data elements.

5.9.5.7 License The attached license is valid for 1 year.

5.9.5.8 Evaluation

5.9.5.8.1 Context of use Publish Reuse + Reusing data for predictive analytics and machine learning.

5.9.5.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + Coding skills are needed.

5.9.5.8.3 Functionalities Functionality Available Data Quality - Assessment Data Cleansing - Data - Blending/Enrichment Data format conversion - Descriptive analytics + Visualisation + Dashboarding - Prescriptive analytics + Storytelling / - Notebooks Focus on machine learning and predictive analytics.

Page 105 of 119

D1.3 Open Data technological study

Support Available Big Data + Multidimensional data - (cubes) Linked Data -

5.9.5.8.4 Assessment Aspect Assessment Ease of installation - Ease of use ± User centric versus Technology technology/science centric Very powerful, but too difficult to install and use for non ICT people.

5.9.5.8.5 Notes Difficult to install and aimed at real data scientists.

5.9.5.9 Contact Website: http://www.skytree.net/ Download: http://pages.skytree.net/free-download- GUI.html?utm_medium=website&utm_source=skytree

5.9.6 RapidMiner Studio

5.9.6.1 What is RapidMiner is an open source data science platform of which RapidMiner Studio is the desktop version with a visual development environment.

5.9.6.2 How does it look like Figure 24 presents the RapidMiner Studio platform.

Page 106 of 119

D1.3 Open Data technological study

Figure 24. RapidMiner screenshot

5.9.6.3 Developer/maintainer RapidMiner Gmbh30.

5.9.6.4 Used by More than 100.000 users.

5.9.6.5 Functionalities The tool allows to connect to datasets, to profile the dataset, to cleanse and enrich. At any point descriptive analytics can be calculated and it offers a full range of visualisations. More than 120 modelling and prediction algorithms are available. It also has features to score and evaluate the models.

5.9.6.6 Technical constraints Is available for Windows, Mac and Linux. Free version is limited to 10.000 data rows and 1 processor.

30 https://rapidminer.com/

Page 107 of 119

D1.3 Open Data technological study

5.9.6.7 License An open source core is published under AGPL-3.0. The source code is available on GitHub.

5.9.6.8 Evaluation

5.9.6.8.1 Context of use Publish Reuse + For building pipelines with data with the purpose for doing predictive analytics.

5.9.6.8.2 Target groups Private Sector Employee Public Sector Employee Student with coding Student without skills coding skills + + + Aimed at people that are using data for predictive analytics.

5.9.6.8.3 Functionalities Functionality Available Data Quality + Assessment Data Cleansing + Data + Blending/Enrichment Data format conversion + Descriptive analytics + Visualisation + Dashboarding - Prescriptive analytics + Storytelling / - Notebooks Covers the whole data process, from cleansing over enrichment to descriptive analytics including visualisation and prescriptive analytics.

Support Available Big Data +

Page 108 of 119

D1.3 Open Data technological study

Multidimensional data - (cubes) Linked Data ± Big Data are supported and using a plugin one can access SPARQL endpoints and RDF.

5.9.6.8.4 Assessment Aspect Assessment Ease of installation + Ease of use ± User centric versus mix technology/science centric Very elaborate toolset but comes with a learning curve.

5.9.6.9 Contact Website: https://rapidminer.com/products/studio/ Download: https://rapidminer.com/signup/

5.10 Linked Data tooling

5.10.1 TopBraid Composer Free

5.10.1.1 What is Topbraid Composer is an IDE for working with RDF triples and linked data.

5.10.1.2 How does it look like? Figure 25 shows the Topbraid Composer environment.

Page 109 of 119

D1.3 Open Data technological study

Figure 25. TopBraid Composer screendump

5.10.1.3 Developer/maintainer TopQuadrant31.

5.10.1.4 Used by KOOP, P&G, Mayo Clinic, Lockheed Martin, Thomson Reuters, AstraZeneca, UCB, Pearson, Lilly, Nasa, JPMorganChase etc.

5.10.1.5 Functionalities TBC allows to import RDF files, to integrate those, to edit triples, to validate the triples against constraints, to infer new triples based on ontologies and/or rules, to query the triples full-text and via SPARQL. The triples can be exported again into several serialisations. The standard edition, which is available for evaluation for 30 days, adds a lot e.g. graphical representations of the resources and the model, many ways to convert legacy data such as tsv’s, relational databases into rdf and connections with the leading triple stores.

31 http://www.topquadrant.com

Page 110 of 119

D1.3 Open Data technological study

5.10.1.6 Technical constraints TBC runs in the Eclipse 4.3 platform and requires Java 8 (Oracle JRE/JDK). TBC is available for Windows, Mac and Linux.

5.10.1.7 License Closed source, commercial.

5.10.1.8 Evaluation Although not really aimed at business users since an extension to a programming IDE (e.g. Eclipse) it is still the best interface for working with plain triples.

5.10.1.9 Contact Website: http://www.topquadrant.com/tools/modeling-topbraid-composer-standard-edition/ Download: http://www.topquadrant.com/downloads/topbraid-composer-install/

5.10.2 fluidOps Information Workbench

5.10.2.1 What is The Information Workbench is a platform for Linked Data application development.

5.10.2.2 How does it look like? Figure 26 shows the IWB platform.

Page 111 of 119

D1.3 Open Data technological study

Figure 26. IWB screenshot

5.10.2.3 Developer/maintainer fluidOps32.

5.10.2.4 Used by Many cloud infrastructure and data center users.

5.10.2.5 Functionalities IWB comes with facilities to convert legacy data in formats such as CSV, TSV, relational databases, XML, JSON into RDF. IWB allows to integrate several datasources and to search and query these. For every resource one can define a wiki page. In such a wiki page several widgets can be plugged in for visualisation, social media integration, LOD integration, editing etc.

5.10.2.6 Technical constraints IWB can be installed on Windows, Mac and Linux and runs as a webservice then.

32 www.fluidops.com/en

Page 112 of 119

D1.3 Open Data technological study

5.10.2.7 License Free licenses are available for educational use.

5.10.2.8 Evaluation For people familiar with wiki syntax and knowledge of SPARQL a very powerful and easy environment to build semantic applications.

5.10.2.9 Contact Website: https://www.fluidops.com/en/portfolio/information_workbench/ Download: http://appcenter.fluidops.com/resource/Download

Page 113 of 119

D1.3 Open Data technological study

6 Target group – tool mapping In this section we try to map the describe tools to the different target groups (personas) of our curriculum.

6.1 Private sector employee Table 11. Private sector employee mapping Re-use Tool Obtain Portal navigation and CKAN search Download/via API CKAN Scrub Cleanse (remove, Trifacta Wrangler change data types, handle missing data) Transform Trifacta Wrangler

Enhance, Blend Trifacta Wrangler Explore Visualize (table, Voyager, Tableau graphs, plots) Public Derive stats Trifacta Wrangler Model Explain, Predict BigML

Interpret and Story telling Tableau Public communicate

A private sector employee needs to be able to find and download interesting datasets. Datasets are to be found at open data portals of which CKAN is the most broadly used. For data wrangling we propose Trifacta Wrangler which is making big inroads in the private sector. It offers a business user oriented interface combined with machine learning and the capability to handle big data. We consider this a better choice for this audience than OpenRefine. For just learning to know your data, we propose Voyager which automatically generates the appropriate visualisations. If one wants to build an own visualisation we propose the public version Tableau, the largely popular business BI tool, which can also be used for dashboard building and storytelling.

Page 114 of 119

D1.3 Open Data technological study

According to us, the only predictive analytics tools easily usable by non-data scientists is BigML. Hence our inclusion.

6.2 Student with no coding skills Table 12. Student (no coding) mapping Re-use Tool Obtain Portal navigation and CKAN search Download/via API CKAN Scrub Cleanse (remove, OpenRefine change data types, handle missing data) Transform OpenRefine

Enhance, Blend OpenRefine Explore Visualize (table, Voyager, Google graphs, plots) Fusion tables Derive stats Model Explain, Predict BigML

Interpret and Story telling communicate

A student needs to be able to find and download interesting datasets. Datasets are to be found at open data portals of which CKAN is the most broadly used. For data wrangling we propose OpenRefine. OpenRefine has been the tool of choice in the Open Data world with lots of tutorials and supporting material to be found online. For just learning to know your data, we propose Voyager which automatically generates the appropriate visualisations. If one wants to build an own visualisation we propose Google Fusion tables.. According to us, the only predictive analytics tools easily usable by non-data scientists is BigML. Hence our inclusion also for students. BigML offers specific educational programs.

Page 115 of 119

D1.3 Open Data technological study

6.3 Student with coding skills Table 13. Student coding mapping Re-use Tool Obtain Portal navigation and CKAN search Download/via API CKAN Scrub Cleanse (remove, OpenRefine, Data change data types, Science Studio handle missing data) Transform OpenRefine, Data Science Studio

Enhance, Blend OpenRefine, Data Science Studio Explore Visualize (table, Voyager, Data Science graphs, plots) Studio, Plotly, Vega Derive stats Data Science Studio Model Explain, Predict Data Science Studio

Interpret and Story telling communicate

Also for this group CKAN as the system where to find and download data. For all the rest Dataiku’s Data Science Studio can be used, since it is an all-embracing and comprehensive environment for all data science related tasks. In addition it is a programming languages polyglot environment with support for Python, R, SQL, Scala, Hive, Impala, … For students focused on programming on R, Exploratory is a viable alternative. Suggested for building interactive data visualisations are Plotly and Vega.

6.4 Public Sector employee Table 14. Public sector employee as publisher mapping Publish Tools Collect

Prepare QA and cleanse Dataproofer, CSV Lint, OpenRefine

Page 116 of 119

D1.3 Open Data technological study

Transform to Format OpenRefine

Assign Dataset level CKAN Metadata (including open license)

Data level Publish Bulk

API CKAN Make Portal CKAN Discoverable Maintain CKAN

Public sector employees with the primary concern of publishing open data should become familiar with the tools for assessing and ameliorating the quality of the data to be published: Dataproofer, CSV Lint, OpenRefine. And they need to become familiar with the functionalities of Open Data portal software: CKAN et al.

Table 15. Public sector employee as reuser mapping Re-use Tool Obtain Portal navigation and CKAN search Download/via API CKAN Scrub Cleanse (remove, OpenRefine change data types, handle missing data) Transform OpenRefine

Enhance, Blend OpenRefine Explore Visualize (table, Voyager, Google graphs, plots) Fusion tables Derive stats

Page 117 of 119

D1.3 Open Data technological study

Model Explain, Predict BigML

Interpret and Story telling communicate

For public sector employees re-using data, the recommendation is similar to those for non-coding students. If they want to visualize data on their Open Data portal, we encourage them to have a look to PlotlY and Vega.

Page 118 of 119

D1.3 Open Data technological study

7 Conclusion

In this study we enumerated the tools which can be used to publish and to reuse open data. For each of the tools we included where it fits in the Open Data Lifecycle, its targets groups, its functionalities, contact info, a general assessment and some notes based on own experience. At the end we made recommendations which tools to use in every phase of the Open Data Lifecycle for each of our target groups: private sector employees, students with no coding skills, students with coding skills, public sector employees as publisher and re-user. For the publishing phase we consider it essential to get data cleaning skills using OpenRefine and a thorough knowledge of CKAN, the most broadly used open data portal software. For reusing data the recommendations differ depending on the profile and coding skills of the user. Recommendations for all groups are OpenRefine for data wrangling, Voyager for data understanding, Google Fusion Tables for data visualisation and BigML for prescriptive analytics. Sometimes better alternatives are available. E.g. Trifacta Wrangler for data wrangling and Tableau Public for visualisation in the private sector and Dataiku’s Data Science Studio (all-encompassing) and Plotly or Vega for visualisations when people have coding skills. The most important finding is that most Open Data Lifecycle phases can be covered easily with even multiple valid choices so that curriculum building can leverage this.

Page 119 of 119