Big Data Technical Working Groups
White Paper
BIG 318062
Project Acronym: BIG Project Title: Big Data Public Private Forum (BIG) Project Number: 318062 Instrument: CSA Thematic Priority: ICT-2011.4.4
D2.2.2 Final Version of Technical White Paper Work Package: WP2 Strategy & Operations Due Date: 28/02/2014 Submission Date: 14/05/2014 Start Date of Project: 01/09/2012 Duration of Project: 26 Months Organisation Responsible of Deliverable: NUIG Version: 1.0 Status: Final Author name(s): Edward Curry (NUIG) Panayotis Kikiras (AGT), Andre Freitas (NUIG) John Domingue (STIR) Andreas Thalhammer (UIBK) Nelia Lasierra (UIBK) Anna Fensel (UIBK) Marcus Nitzschke (INFAI) Axel Ngonga (INFAI) Michael Martin (INFAI) Ivan Ermilov (INFAI) Mohamed Morsey (INFAI) Klaus Lyko (INFAI) Philipp Frischmuth (INFAI) Martin Strohbach (AGT) Sarven Capadisli (INFAI) Herman Ravkin (AGT) Sebastian Hellmann (INFAI) Mario Lischka (AGT) Tilman Becker (DFKI) Jörg Daubert (AGT) Tim van Kasteren (AGT) Amrapali Zaveri (INFAI) Umair Ul Hassan (NUIG) Reviewer(s): Amar Djalil Mezaour Helen Lippell (PA) (EXALEAD) Marcus Nitzschke (INFAI) Axel Ngonga (INFAI) Michael Hausenblas (NUIG) Klaus Lyko (INFAI) Tim Van Kasteren (AGT) Nature: R – Report P – Prototype D – Demonstrator O - Other Dissemination level: PU - Public CO - Confidential, only for members of the consortium (including the Commission) RE - Restricted to a group specified by the consortium (including the Commission Services) Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)
ii BIG 318062
Revision history Version Date Modified by Comments 0.1 25/04/2013 Andre Freitas, Aftab Finalized the first version of Iqbal, Umair Ul the whitepaper Hassan, Nur Aini (NUIG) 0.2 27/04/2013 Edward Curry (NUIG) Review and content modification 0.3 27/04/2013 Helen Lippell (PA) Review and corrections 0.4 27/04/2013 Andre Freitas, Aftab Fixed corrections Iqbal (NUIG) 0.5 20/12/2013 Andre Freitas (NUIG) Major content improvement 0.6 20/02/2014 Andre Freitas (NUIG) Major content improvement 0.7 15/03/2014 Umair Ul Hassan Content contribution (human computation, case studies) 0.8 10/03/2014 Helen Lippell (PA) Review and corrections 0.91 20/03/2014 Edward Curry (NUIG) Review and content modification 0.92 06/05/2014 Andre Freitas, Edward Added Data Usage and minor Curry (NUIG) corrections 0.93 11/05/2014 Axel Ngonga, Klaus Final review Lyko, Marcus Nitzschke (INFAI) 1.0 13/05/2014 Edward Curry (NUIG) Corrections from final review
iii BIG 318062
Copyright © 2012, BIG Consortium
The BIG Consortium (http://www.big-project.eu/) grants third parties the right to use and distribute all or parts of this document, provided that the BIG project and the document are properly referenced.
THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
iv BIG 318062
Table of Contents 1. Executive Summary ...... 1 1.1. Understanding Big Data ...... 1 1.2. The Big Data Value Chain ...... 1 1.3. The BIG Project ...... 2 1.4. Key Technical Insights ...... 3 2. Data Acquisition ...... 4 2.1. Executive Summary ...... 4 2.2. Big Data Acquisition Key Insights ...... 5 2.3. Social and Economic Impact ...... 7 2.4. State of the Art ...... 7 2.4.1 Protocols ...... 7 2.4.2 Software Tools ...... 11 2.5. Future Requirements & Emerging Trends for Big Data Acquisition ...... 22 2.5.1 Future Requirements/Challenges ...... 22 2.5.2 Emerging Paradigms ...... 24 2.6. Sector Case Studies for Big Data Acquisition ...... 25 2.6.1 Health Sector ...... 25 2.6.2 Manufacturing, Retail, Transport ...... 26 2.6.3 Government, Public, Non-profit ...... 28 2.6.4 Telco, Media, Entertainment ...... 30 2.6.5 Finance and Insurance ...... 33 2.7. Conclusion ...... 33 2.8. References ...... 34 2.9. Useful Links ...... 35 2.10. Appendix ...... 36 3. Data Analysis ...... 37 3.1. Executive Summary ...... 37 3.2. Introduction ...... 38 3.3. Big Data Analysis Key Insights ...... 39 3.3.1 General ...... 39 3.3.2 New Promising Areas for Research ...... 39 3.3.3 Features to Increase Take-up ...... 39 3.3.4 Communities and Big Data ...... 40 3.3.5 New Business Opportunities ...... 40 3.4. Social & Economic Impact ...... 40 3.5. State of the art ...... 41 3.5.1 Large-scale: Reasoning, Benchmarking and Machine Learning ...... 42 3.5.2 Stream data processing ...... 45 3.5.3 Use of Linked Data and Semantic Approaches to Big Data Analysis ...... 47 3.6. Future Requirements & Emerging Trends for Big Data Analysis ...... 49 3.6.1 Future Requirements ...... 49 3.6.2 Emerging Paradigms ...... 51 3.7. Sectors Case Studies for Big Data Analysis ...... 53 3.7.1 Public sector ...... 53 3.7.2 Traffic ...... 53 3.7.3 Emergency response ...... 53
v BIG 318062
3.7.4 Health ...... 54 3.7.5 Retail ...... 55 3.7.6 Logistics ...... 55 3.7.7 Finance ...... 55 3.8. Conclusions ...... 56 3.9. Acknowledgements ...... 57 3.10. References ...... 57 4. Data Curation ...... 60 4.1. Executive Summary ...... 60 4.2. Big Data Curation Key Insights ...... 60 4.3. Introduction ...... 62 4.3.1 Emerging Requirements for Big Data: Variety & Reuse ...... 63 4.3.2 Emerging Trends: Scaling-up Data Curation ...... 63 4.4. Social & Economic Impact ...... 64 4.5. Core Concepts & State-of-the-Art ...... 66 4.5.1 Introduction ...... 66 4.5.2 Lifecycle Model ...... 67 4.5.3 Data Selection Criteria ...... 68 4.5.4 Data Quality Dimensions ...... 69 4.5.5 Data Curation Roles ...... 69 4.5.6 Current Approaches for Data Curation ...... 70 4.6. Future Requirements and Emerging Trends for Big Data Curation ...... 71 4.6.1 Introduction ...... 71 4.6.2 Future Requirements ...... 71 4.7. Emerging Paradigms ...... 73 4.7.1 Incentives & Social Engagement Mechanisms ...... 73 4.7.2 Economic Models ...... 74 4.7.3 Curation at Scale ...... 75 4.7.4 Human-Data Interaction ...... 79 4.7.5 Trust ...... 80 4.7.6 Standardization & Interoperability ...... 80 4.7.7 Data Curation Models ...... 81 4.7.8 Unstructured & Structured Data Integration ...... 81 4.8. Sectors Case Studies for Big Data Curation ...... 84 4.8.1 Health and Life Sciences ...... 84 4.8.2 Telco, Media, Entertainment ...... 86 4.8.3 Retail ...... 89 4.9. Conclusions ...... 89 4.10. Acknowledgements ...... 90 4.11. References ...... 90 4.12. Appendix 1: Use Case Analysis ...... 93 5. Data Storage ...... 96 5.1. Executive Summary ...... 96 5.2. Data Storage Key Insights ...... 98 5.3. Social & Economic Impact ...... 99 5.4. State of the Art ...... 101 5.4.1 Hardware and Data Growth Trends ...... 101
vi BIG 318062
5.4.2 Data Storage Technologies ...... 106 5.4.3 Security and Privacy...... 111 5.5. Future Requirements & Emerging Trends for Big Data Storage ...... 117 5.5.1 Future Requirements ...... 117 5.5.2 Emerging Paradigms ...... 119 5.6. Sectors Case Studies for Big Data Storage ...... 122 5.6.1 Health Sector: Social Media Based Medication Intelligence ...... 122 5.6.2 Public Sector ...... 124 5.6.3 Finance Sector: Centralized Data Hub ...... 125 5.6.4 Media & Entertainment: Scalable Recommendation Architecture ...... 126 5.6.5 Energy: Smart Grid and Smart Meters...... 127 5.6.6 Summary ...... 128 5.7. Conclusions ...... 128 5.7.1 References ...... 129 5.8. Overview of NoSQL Databases ...... 134 6. Data Usage ...... 136 6.1. Executive Summary ...... 136 6.2. Data Usage Key Insights ...... 137 6.3. Introduction ...... 137 6.3.1 Overview ...... 138 6.4. Social & Economic Impact ...... 138 6.5. Data Usage State of the Art ...... 139 6.5.1 Big Data Usage Technology Stacks ...... 139 6.5.2 Decision Support ...... 141 6.5.3 Predictive Analysis ...... 142 6.5.4 Exploration ...... 143 6.5.5 Iterative Analysis ...... 143 6.5.6 Visualisation ...... 144 6.6. Future Requirements & Emerging Trends for Big Data Usage ...... 146 6.6.1 Future Requirements ...... 146 6.6.2 Emerging Paradigms ...... 149 6.7. Sectors Case Studies for Big Data Usage ...... 153 6.7.1 Health Care: Clinical Decision Support ...... 153 6.7.2 Public Sector: Monitoring and Supervision of On-line Gambling Operators153 6.7.3 Telco, Media & Entertainment: Dynamic Bandwidth Increase ...... 154 6.7.4 Manufacturing: Predictive Analysis ...... 154 6.8. Conclusions ...... 154 6.9. References ...... 154
vii BIG 318062
Index of Figures
Figure 1-1 The Data Value Chain ...... 1 Figure 1-2 The BIG Project Structure ...... 2 Figure 2-1: Data acquisition and the Big Data value chain...... 4 Figure 2-2: Oracle’s Big Data Processing Pipeline...... 5 Figure 2-3: The velocity architecture (Vivisimo, 2012) ...... 6 Figure 2-4: IBM Big Data Architecture ...... 6 Figure 2-5: AMQP message structure (Schneider, 2013) ...... 8 Figure 2-6: Java Message Service ...... 10 Figure 2-7: Memcached functionality (source: http://memcached.org/about) ...... 11 Figure 2-8: Big Data workflow...... 12 Figure 2-9: Architecture of the Storm framework...... 13 Figure 2-10: A Topology in Storm. The dots in node represent the concurrent tasks of the spout/bolt...... 13 Figure 2-11: Architecture of a S4 processing node...... 15 Figure 2-12: Kafka deployment at LinkedIn ...... 16 Figure 2-13: Schematic showing logical components in a flow. The arrows represent the direction in which events travel across the system...... 18 Figure 2-14: Architecture of a Hadoop multi node cluster ...... 20 Figure 3-1. The Big Data Value Chain...... 38 Figure 4-1 Big Data value chain...... 60 Figure 4-2 The long tail of data curation and the scalability of data curation activities...... 64 Figure 4-3: The data curation lifecycle based on the DCC Curation Lifecycle Model and on the SURF foundation Curation Lifecycle Model...... 68 Figure 4-4 RSC profile of a curator with awards attributed based on his/her contributions...... 85 Figure 4-5 An example solution to a protein folding problem with Fold.it ...... 86 Figure 4-6: PA Content and Metadata Pattern Workflow...... 87 Figure 4-7: A typical data curation process at Thomson Reuters...... 88 Figure 4-8: The NYT article classification curation workflow...... 88 Figure 4-9 Taxonomy of products used by Ebay to categorize items with help of crowdsourcing...... 89 Figure 5-1: Database landscape (Source: 451 Group)...... 96 Figure 5-2: Technical challenges along the data value chain ...... 97 Figure 5-3: Introduction of renewable energy at consumers changes the topology and requires the introduction of new measurement points at the leaves of the grid ...... 100 Figure 5-4: Data growth between 2009 and 2020 ...... 102 Figure 5-5: Useful Big Data sources ...... 103 Figure 5-6: The Untapped Big Data Gap (2012) ...... 103 Figure 5-7: The emerging gap ...... 104 Figure 5-8: Cost of storage ...... 105 Figure 5-9: Data complexity and data size scalability of NoSQL databases ...... 108 Figure 5-10: Source: CSA Top 10 Security & Privacy Challenges. Related challenges highlighted ...... 112 Figure 5-11: Data encryption in “The Intel Distribution for Apache Hadoop”. Source: http://hadoop.intel.com/pdfs/IntelEncryptionforHadoopSolutionBrief.pdf ...... 119 Figure 5-12: General purpose RDBMS processing profile ...... 119 Figure 5-13: Paradigm Shift from pure data storage systems to integrated analytical databases ...... 121 Figure 5-14: Treato Search Results for "Singulair", an asthma medication (Source: http://trato.com) ...... 123 Figure 5-15: Drilldown showing number of Chicago crime incidents for each hour of the day . 125 Figure 5-16: Datameer end-to.end functionality. Source: Cloudera ...... 126
viii BIG 318062
Figure 5-17: Netflix Personalization and Recommendation Architecture. The architecture distinguishes three “layers” addressing a trade-off between computational and real-time requirements (Source: Netflix Tech Blog) ...... 127 Figure 6-1: Technical challenges along the data value chain ...... 136 Figure 6-2: Big Data Technology Stacks for Data Access (source: TU Berlin, FG DIMA 2013) ...... 140 Figure 6-3: The YouTube Data Warehouse (YTDW) infrastructure. Source: (Chattopadhyay, 2011) ...... 141 Figure 6-4: Visual Analytics in action. A rich set of linked visualisations provided by ADVISOR include barcharts, treemaps, dashboards, linked tables and time tables. (Source: Keim et al., 2010, p.29) ...... 151 Figure 6-5: Prediction of UK Big Data job market demand. Actual/forecast demand (vacancies per annum) for big data staff 2007–2017. (Source: e-skills UK/Experian) ...... 147 Figure 6-6: UK demand for big data staff by job title status 2007–2012. (Source: e-skills UK analysis of data provided by IT Jobs Watch) ...... 147 Figure 6-7: Dimensions of Integration in Industry 4.0 from: GE, Industrial Internet, 2012 ...... 151 Figure 6-8: Big Data in the context of an extended service infrastructure. W. Wahlster, 2013. 152
ix BIG 318062
Index of Tables
Table 2-1 The potential data volume growth in a year...... 29 Table 2-2 Sector/Feature matrix...... 33 Table 4-1 Future requirements for data curation...... 73 Table 4-2 Emerging approaches for addressing the future requirements...... 84 Table 4-3 Data features associated with the curated data ...... 93 Table 4-4: Critical data quality dimensions for existing data curation projects ...... 94 Table 4-5: Existing data curation roles and their coverage on existing projects...... 94 Table 4-6: Technological infrastructure dimensions ...... 95 Table 4-7: Summary of sector case studies ...... 95 Table 5-1: Calculation of the amount of data sampled by Smart Meters ...... 128 Table 5-2: Summary of sector case studies ...... 128 Table 5-3: Overview of popular NoSQL databases ...... 135 Table 6-1: Comparison of Data Usage Technologies used in YTDW. Source: (Chattopadhyay, 2011) ...... 141
x BIG 318062
Abbreviations and Acronyms
Abbreviations and Acronyms:
ABE Attribute based encryption ACID Atomicity, Consistency, Isolation, Durability BDaaS Big Data as a service BIG The BIG project BPaaS Business processes as a service CbD Curating by Demonstration CMS Content Management System CPS Cyber-physical system CPU Central processing unit CRUD Create, Read, Update, Delete DB Database DBMS Database management system DCC Digital Curation Centre DRAM Dynamic random access memory ETL Extract-Transform-Load HDD Hard disk drive HDFS Hadoop Distributed File System IaaS Infrastructure as a service KaaS Knowledge as a service ML Machine Learning MDM Master Data Management MIRIAM Minimum Information Required In The Annotation of Models MR MapReduce MRAM Magneto-resistive RAM NGO Non Governmental Organization NLP Natural Language Processing NAND Type of flash memory named after NAND logic gate OLAP Online analytical processing OLTP Online transaction processing NRC National research council PaaS Platform as a service PB Petabyte PbD Programming by Demonstration PCRAM Phase change RAM PPP Public Private Partnerships RAM Random access memory RDBMS Relational database management system
xi BIG 318062
RDF Resource Description Framework RDF Resource Description Framework RPC Remote procedure call SaaS Software as a service SSD Solid state drive STTRAM Spin-transfer torque RAM SPARQL Recursive acronym for SPARQL Protocol and RDF Query Language SQL Structured query language SSL Secure sockets layer TB Terabyte TLS Transport layer security UnQL Unstructured query language W3C World Wide Web Consortium WG Work group XML Extensible Markup Language
xii BIG 318062
1. Executive Summary
The BIG Project (http://www.big-project.eu/) is a EU coordination and support action to provide a roadmap for Big Data within Europe. This whitepaper details the results from the Data Value Chain Technical Working groups describing the state of the art in each part of the chain together with emerging technological trends for exploiting Big Data.
1.1. Understanding Big Data
Big Data is an emerging field where innovative technology offers alternatives to resolve the inherent problems that appear when working with huge amounts of data, providing new ways to reuse and extract value from information. The ability to effectively manage information and extract knowledge is now seen as a key competitive advantage, and many companies are building their core business on their ability to collect and analyse their information to extract business knowledge and insight. As a result Big Data technology adoption within industrial domains is not a luxury but an imperative need for most organizations to gain competitive advantage. The main dimensions of Big Data are typically characterized by the 3 Vs, volume (amount of data), velocity (speed of data), and variety (range of data types and sources). The V’s of big data challenge the fundamentals of our understand of existing technical approaches and require new forms of data processing enabling enhanced decision making, insight discovery and process optimization. Volume – places scalability at the centre of all processing. Large-scale reasoning, semantic processing, data mining, machine learning and information extraction are required. Velocity – this challenge has resulted in the emergence of the areas of stream data processing, stream reasoning and stream data mining to cope with high volumes of incoming raw data. Variety – may take the form of differing syntactic formats (e.g. spreadsheet vs. CSV) or differing data schemas or differing meanings attached to the same syntactic forms.
1.2. The Big Data Value Chain
Value chains have been used as a decision support tool to model the chain of activities that an organisation performs in order to deliver a valuable product or service to the market. The value chain categorizes the generic value-adding activities of an organization allowing them to be understood and optimised. A value chain is made up of series of subsystems each with inputs, a transformation processes, and outputs. As an analytical tool, the value chain can be applied to the information systems to understand the value-creation of data technologies.
Figure 1-1 The Data Value Chain
The Data Value Chain, as illustrated in Figure 1, models the high-level activities that comprise an information system. The data value chain identifies the following activities: Data Acquisition is the process of gathering, filtering and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out.
1 BIG 318062