Towards a Big Data Reference Architecture

Eindhoven University of Technology Department of Mathematics and Computer Science Master’s thesis Towards a Big Data Reference Architecture 13th October 2013 Author: Markus Maier [email protected] Supervisor: dr. G.H.L. Fletcher g.h.l.fl[email protected] Assessment committee: dr. G.H.L. Fletcher dr. A. Serebrenik dr.ir. I.T.P. Vanderfeesten Abstract Technologies and promises connected to ‘big data’ got a lot of attention lately. Leveraging emerging ‘big data’ sources extends requirements of traditional data management due to the large volume, velocity, variety and veracity of this data. At the same time, it promises to extract value from previously largely unused sources and to use insights from this data to gain a competitive advantage. To gain this value, organizations need to consider new architectures for their data management systems and new technologies to implement these architectures. In this master’s thesis I identify additional requirements that result from these new characteristics of data, design a reference architecture combining several data management components to tackle these requirements and finally discuss current technologies, which can be used to implement the reference architecture. The design of the reference architecture takes an evolutionary approach, building from traditional enterprise data warehouse architecture and integrating additional components aimed at handling these new requirements. Implementing these components involves technologies like the Apache Hadoop ecosystem and so-called ‘NoSQL’ databases. A verification of the reference architecture finally proves it correct and relevant to practice. The proposed reference architecture and a survey of the current state of art in ‘big data’ technologies guides designers in the creation of systems, which create new value from existing, but also previously under-used data. They provide decision makers with entirely new insights from data to base decisions on. These insights can lead to enhancements in companies’ productivity and competitiveness, support innovation and even create entirely new business models. ii Preface This thesis is the result of the final project for my master’s program in Business Information Systems at Eindhoven University of Technology. The project was conducted over a time of 7 months within the Web Engineering (formerly Databases and Hypermedia) group in the Mathematics and Computer Science department. I want to use this place to mention and thank a couple of people. First, I want to express my greatest gratitude to my supervisor George Fletcher for all his advice and feedback, for his engagement and flexibility. Second, I want to thank the members of my assessment committee, Irene Vanderfeesten and Alexander Serebrenik, for reviewing my thesis, attending my final presentation and giving me critical feedback. Finally, I want to thank all the people, family and friends, for their support during my whole studies and especially during my final project. You helped my through some stressful and rough times and I am very thankful to all of you. Markus Maier, Eindhoven, 13th October 2013 iii 1 Introduction 1.1 Motivation Big Data has become one of the buzzwords in IT during the last couple of years. Initially it was shaped by organizations which had to handle fast growth rates of data like web data, data resulting from scientific or business simulations or other data sources. Some of those companies’ business models are fundamentally based on indexing and using this large amount of data. The pressure to handle the growing data amount on the web e.g. lead Google to develop the Google File System [119] and MapReduce [94]. Efforts were made to rebuild those technologies as open source software. This resulted in Apache Hadoop and the Hadoop File System [12, 226] and laid the foundation for technologies summarized today as ‘big data’. With this groundwork traditional information management companies stepped in and invested to extend their software portfolios and build new solutions especially aimed at Big Data analysis. Among those companies were IBM [27, 28], Oracle [32], HP [26], Microsoft [31], SAS [35] and SAP [33, 34]. At the same time start-ups like Cloudera [23] entered the scene. Some of the ‘big data’ solutions are based on Hadoop distributions, others are self-developed and companies’ ‘big data’ portfolios are often blended with existing technologies. This is e.g. the case when big data gets integrated with existing data management solutions, but also for complex event processing solutions which are the basis (but got further developed) to handle stream processing of big data 1. The effort taken by software companies to get part of the big data story is not surprising considering the trends analysts predict and the praise they sing on ‘big data’ and its impact onto business and even society as a whole. IDC predicts in its ‘The Digital Universe’ study that the digital data created and consumed per year will grow up to 40.000 exabyte by 2020, from which a third 2 will promise value to organizations if processed using big data technologies [115]. IDC also states that in 2012 only 0.5% of potentially valuable data were analyzed, calling this the ‘Big Data Gap’. While the McKinsey Global Institute also predicts that the data globally generated is growing by around 40% per year, they furthermore describe big data trends in terms of monetary figures. They project the yearly value of big data analytics for the US health care sector to be around 300 billion $. They also predict a possible value of around 250 billion Ä for the European public sector and a potential improvement of margins in the retail industry by 60% [163]. 1 e.g. IBM InfoSphere Streams [29] 2 around 13.000 exabyte 1 CHAPTER 1. INTRODUCTION With this kind of promises the topic got picked up by business and management journals to emphasize and describe the impact of big data onto management practices. One of the terms coined in that context is ‘data-guided management’ [157]. In MIT Sloan Management Review Thomas H. Davenport discusses how organisations applying and mastering big data differ from organisations with a more traditional approach to data analysis and what they can gain from it [92]. Harvard Business Review published an article series about big data [58, 91, 166] in which they call the topic a ‘management revolution’ and describe how ‘big data’ can change management, how an organisational culture needs to change to embrace big data and what other steps and measures are necessary to make it all work. But the discussion did not stop with business and monetary gains. There are also several publications stressing the potential of big data to revolutionize science and even society as a whole. A community whitepaper written by several US data management researchers states, that a ‘major investment in Big Data, properly directed, can result not only in major scientific advances but also lay the foundation for the next generation of advances in science, medicine, and business’ [45]. Alex Pentland, who is director of MIT’s Human Dynamics Laboratory and considered one of the pioneers of incorporating big data into the social sciences, claims that big data can be a major instrument to ‘reinvent society’ and to improve it in that process [177]. While other researchers often talk about relationships in social networks when talking about big data, Alex Pentland focusses on location data from mobile phones, payment data from credit cards and so on. He describes this data as data about people’s actual behaviour and not so much about their choices for communication. From his point of view, ‘big data is increasingly about real behavior’ [177] and connections between individuals. In essence he argues that this allows the analysis of systems (social, financial etc.) on a more fine-granular level of micro-transactions between individuals and ‘micro-patterns’ within these transactions. He further argues, that this will allow a far more detailed understanding and a far better design of new systems. This transformative potential to change the architecture of societies was also recognized by mainstream media and is brought into public discussion. The New York Times e.g. declared ‘The Age of Big Data’ [157]. There were also books published to describe how big data transforms the way ‘we live, work and think’ [165] to a public audience and to present essays and examples how big data can influence mankind [201]. However the impact of ‘big data’ and where it is going is not without controversies. Chris Anderson, back then editor in chief of Wired magazine, started a discourse, when he announced ‘the end of theory’ and the obsolescence of the scientific method due to big data [49]. In his essay he claimed, that with massive data the scientific method - observe, develop a model and formulate hypothesis, test the hypothesis by conducting experiments and collecting data, analyse and interpret the data - would be obsolete. He argues that all models or theories are erroneous and the use of enough data allows to skip the modelling step and instead leverage statistical methods to find patterns without creating hypothesis first. In that sense he values correlation over causation. This gets apparent in the following quote: Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. [49] Chris Anderson is not alone with his statement. While they do not consider it the ‘end of theory’ in general, Viktor Mayer-Schönberger and Kenneth Cukier also emphasize on the importance of correlation and favour it over causation [165, pp. 50-72]. Still this is a rather extreme position and is questioned by several other authors. Boyd and Crawford, while not denying its possible value, published an article to provoke an overly positive and simplified point of view of ‘big data’ [73].

Load more