Open Source Platforms for Big Data Analytics

Open Source Platforms for Big Data Analytics JORGE FILIPE CÂNDIDO NEREU Outubro de 2017 OPEN SOURCE PLATFORMS FOR BIG DATA ANALYTICS Jorge Filipe Cândido Nereu Dissertação para obtenção do Grau de Mestre em Engenharia Informática, Área de Especialização em Sistemas de Informação e Conhecimento Orientador: Ana Maria Neves de Almeida Co-orientador: Jorge Fernandes Rodrigues Bernardino Porto, outubro 2017 ii Resumo O conceito de Big Data tem tido um grande impacto no campo da tecnologia, em particular na gestão e análise de enormes volumes de informação. Atualmente, as organizações consideram o Big Data como uma oportunidade para gerir e explorar os seus dados o máximo possível, com o objetivo de apoiar as suas decisões dentro das diferentes áreas operacionais. Assim, é necessário analisar vários conceitos sobre o Big Data e o Big Data Analytics, incluindo definições, características, vantagens e desafios. As ferramentas de Business Intelligence (BI), juntamente com a geração de conhecimento, são conceitos fundamentais para o processo de tomada de decisão e transformação da informação. Ao investigar as plataformas de Big Data, as práticas industriais atuais e as tendências relacionadas com o mundo da investigação, é possível entender o impacto do Big Data Analytics nas pequenas organizações. Este trabalho pretende propor soluções para as micro, pequenas ou médias empresas (PME) que têm um grande impacto na economia portuguesa, dado que representam a maioria do tecido empresarial. As plataformas de código aberto para o Big Data Analytics oferecem uma grande oportunidade de inovação nas PMEs. Este trabalho de pesquisa apresenta uma análise comparativa das funcionalidades e características das plataformas e os passos a serem tomados para uma análise mais profunda e comparativa. Após a análise comparativa, apresentamos uma avaliação e seleção de plataformas Big Data Analytics (BDA) usando e adaptando a metodologia QSOS (Qualification and Selection of software Open Source) para qualificação e seleção de software open-source. O resultado desta avaliação e seleção traduziu-se na eleição de duas plataformas para os testes experimentais. Nas plataformas de software livre de BDA foi usado o mesmo conjunto de dados assim como a mesma configuração de hardware e software. Na comparação das duas plataformas, demonstrou que a HPCC Systems Platform é mais eficiente e confiável que a Hortonworks Data Platform. Em particular, as PME portuguesas devem considerar as plataformas BDA como uma oportunidade de obter vantagem competitiva e melhorar os seus processos e, consequentemente, definir uma estratégia de TI e de negócio. Por fim, este é um trabalho sobre Big Data, que se espera que sirva como um convite e motivação para novos trabalhos de investigação. Palavras-chave: Big Data, Big Data Analytics, BI, Big Data Platforms. iii iv Abstract The concept of Big Data has been having a great impact in the field of technology, particularly in the management and analysis of huge volumes of information. Nowadays organizations look for Big Data as an opportunity to manage and explore their data the maximum they can, with the objective of support decisions within its different operational areas. Thus, it is necessary to analyse several concepts about Big Data and Big Data Analytics, including definitions, features, advantages and disadvantages. Business intelligence along with the generation of knowledge are fundamental concepts for the process of decision-making and transformation of information. By investigate today's big data platforms, current industrial practices and related trends in the research world, it is possible to understand the impact of Big Data Analytics on small organizations. This research intends to propose solutions for micro, small or medium enterprises (SMEs) that have a great impact on the Portuguese economy since they represent approximately 90% of the companies in Portugal. The open source platforms for Big Data Analytics offers a great opportunity for SMEs. This research work presents a comparative analysis of those platforms features and functionalities and the steps that will be taken for a more profound and comparative analysis. After the comparative analysis, we present an evaluation and selection of Big Data Analytics (BDA) platforms using and adapting the Qualification and Selection of software Open Source (QSOS) method. The result of this evaluation and selection was the selection of two platforms for the empirical experiment and tests. The same testbed and dataset was used in the two Open Source Big Data Analytics platforms. When comparing two BDA platforms, HPCC Systems Platform is found to be more efficient and reliable than Hortonworks Data Platform. In particular, Portuguese SMEs should consider for BDA platforms an opportunity to obtain competitive advantage and improve their processes and consequently define an IT and business strategy. Finally, this is a research work on Big Data; it is hoped that this will serve as an invitation and motivation for new research. Keywords: Big Data, Big Data Analytics, BI, Big Data Platforms. v vi vii Table of Contents 1 Introduction ...............................................................................1 1.1 Problem ............................................................................................ 1 1.2 Objectives ......................................................................................... 2 1.3 Document structure .............................................................................. 2 2 Value Analysis .............................................................................3 2.1 Value Networks ................................................................................... 3 2.2 Value Proposition ................................................................................. 4 2.3 Canvas Model ...................................................................................... 4 3 Context .....................................................................................5 3.1 Context of the work .............................................................................. 5 3.2 SMEs ................................................................................................. 6 3.2.1 Definition of SMEs .......................................................................... 6 3.2.2 Portuguese SMEs ............................................................................ 6 3.2.3 SMEs Innovation as opportunity to grow ................................................ 7 3.3 Related Work ...................................................................................... 7 4 Big Data Concepts ...................................................................... 11 4.1 Big Data .......................................................................................... 11 4.1.1 Types of Big Data ......................................................................... 12 4.1.2 Big Data Characteristics ................................................................. 13 4.2 Big Data Storage and Management .......................................................... 17 4.2.1 Non-relational databases ............................................................... 18 4.2.2 In-Memory Databases .................................................................... 18 4.3 Big Data Analytics .............................................................................. 18 4.3.1 In-Memory analytics...................................................................... 20 4.3.2 Real Time analytics ...................................................................... 20 4.3.3 Big Data Analytical Methods and Decision Making .................................. 20 4.4 Big Data Ecosystems ........................................................................... 22 5 Open Source Big Data Platforms ..................................................... 25 5.1 Apache Hadoop ................................................................................. 26 5.1.1 MapReduce ................................................................................ 27 5.1.2 Hadoop Distributed File System (HDFS) .............................................. 28 5.2 Cloudera ......................................................................................... 29 5.3 Hortonworks Data Platform (HDP) ........................................................... 30 5.4 HPCC System .................................................................................... 31 5.5 Apache Apex ..................................................................................... 32 5.6 Apache Storm .................................................................................... 33 5.7 Apache Drill ...................................................................................... 34 5.8 Apache Solr ....................................................................................... 34 5.9 Apache Spark ..................................................................................... 36 5.10 OS Big Data Platforms Comparison ........................................................... 37 5.11 Summary .......................................................................................... 38 6 Methodology ............................................................................. 41 6.1 Design Method ................................................................................... 41 6.1.1 Research Method .........................................................................

Open Source Platforms for Big Data Analytics

Apache Apex: Next Gen Big Data Analytics

The Cloud‐Based Demand‐Driven Supply Chain

A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem

Multimedia Big Data Processing Using Hpcc Systems

Pohorilyi Magistr.Pdf

Performance Tuning Apache Drill on Hadoop Clusters with Evolutionary Algorithms

Apache Calcite: a Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Integrazioa Hizkuntzaren Prozesamenduan Anotazio-Eskemak Eta Elkarreragingarritasuna. Testuen Prozesatze Masiboa, Datu Handien T

Umltographdb: Mapping Conceptual Schemas to Graph Databases

Deliver Performance and Scalability with Hitachi Vantara's Pentaho

The HPCC Cluster Computing Paradigm and an Efficient Data-Centric Programming Language Are Key Factors in Our Company's Success

Dzone-Guide-To-Big-Data.Pdf