Tools for Big Data Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Tools for Big Data Analysis Masaryk University Faculty of Informatics Tools for big data analysis Master’s Thesis Bc. Martin Macák Brno, Spring 2018 Replace this page with a copy of the official signed thesis assignment anda copy of the Statement of an Author. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Martin Macák Advisor: doc. Ing. RNDr. Barbora Bühnová, Ph.D. i Acknowledgements I would like to thank my supervisor, doc. Ing. RNDr. Barbora Bühnová, Ph.D. for offering me to work on this thesis. Her support, guidance, and patience greatly helped me to finish it. I would also like to thank her for introducing me to the great team of people in the CERIT-SC Big Data project. From this team, I would like to especially thank RNDr. Tomáš Rebok, Ph.D., who had many times found time for me, to provide me useful advice, and Bruno Rossi, PhD, who had given me the opportunity to present the results of this thesis in LaSArIS seminar. I would also like to express my gratitude for the support of my family, my parents, Jana and Alexander, and the best sister, Nina. My thanks also belong to my supportive friends, mainly Bc. Tomáš Milo, Bc. Peter Kelemen, Bc. Jaroslav Davídek, Bc. Štefan Bojnák, and Mgr. Ondřej Gasior. Lastly, I would like to thank my girlfriend, Bc. Iveta Vidová for her patience and support. iii Abstract This thesis focuses on the design of a Big Data tool selection diagram, which can help to choose the right open source tools for a given Big Data problem. The thesis includes the tool classification into compo- nents and proposes the Big Data tool architecture for a general Big Data problem, which illustrates the communication between those components. This thesis has chosen some of those components and has researched them in more detail, creating an overview of the actual Big Data tools. Based on this overview, the initial version of the Big Data tool selection diagram, which contains storage and processing tools, is created. Then the thesis proposes the process of diagram validation and provides a set of tests as examples. Those tests are implemented by comparing the relevant results of the solution using a tool that is chosen by a diagram and the solution using another tool. iv Keywords Big Data, Big Data tools, Big Data architecture, Big Data storage, Big Data processing v Contents 1 Introduction 1 2 Big Data 3 2.1 Characteristics ........................3 2.2 Big Data system requirements ................4 2.2.1 Scalability . .4 2.2.2 Distribution models . .4 2.2.3 Consistency . .6 3 State of the Art in Big Data Tools 7 4 Big Data Tools Architecture 9 4.1 Related work ..........................9 4.2 Classification .........................9 4.3 Proposed architecture ..................... 10 5 Big Data Storage Systems 13 5.1 Relational database management systems .......... 13 5.1.1 Data warehouse databases . 14 5.1.2 NewSQL database management systems . 15 5.1.3 Summary . 17 5.2 NoSQL database management systems ............ 17 5.2.1 Key-value stores . 18 5.2.2 Document stores . 21 5.2.3 Column-family stores . 25 5.2.4 Graph databases . 26 5.2.5 Multi-model databases . 29 5.2.6 Summary . 31 5.3 Time-series database management systems .......... 32 5.3.1 InfluxDB . 33 5.3.2 Riak TS . 33 5.3.3 OpenTSDB . 34 5.3.4 Druid . 34 5.3.5 SiriDB . 35 5.3.6 TimescaleDB . 35 5.3.7 Prometheus . 35 vii 5.3.8 KairosDB . 36 5.3.9 Summary . 36 5.4 Distributed file systems .................... 37 5.4.1 Hadoop Distributed File System . 38 5.4.2 SeaweedFS . 38 5.4.3 Perkeep . 39 5.4.4 Summary . 39 6 Big Data Processing Systems 41 6.1 Batch processing systems ................... 41 6.1.1 Apache Hadoop MapReduce . 41 6.1.2 Alternatives . 43 6.2 Stream processing systems .................. 43 6.2.1 Apache Storm . 43 6.2.2 Alternatives . 44 6.3 Graph processing systems ................... 44 6.3.1 Apache Giraph . 45 6.3.2 Alternatives . 46 6.4 High-level representation tools ................ 46 6.4.1 Apache Hive . 46 6.4.2 Apache Pig . 47 6.4.3 Summingbird . 47 6.4.4 Alternatives . 48 6.5 General-purpose processing systems ............. 49 6.5.1 Apache Spark . 49 6.5.2 Apache Flink . 50 6.5.3 Alternatives . 51 6.6 Summary ........................... 51 7 Tool Selection Diagram 53 7.1 Validation ........................... 55 8 Attachments 57 9 Conclusion 59 9.1 Future directions ....................... 59 Bibliography 61 viii List of Tables 5.1 Basic summary of relational database management systems 17 5.2 Basic summary of NoSQL database management systems 32 5.3 Basic summary of time-series database management systems 37 5.4 Basic summary of distributed file systems 39 6.1 Basic summary of processing systems 52 7.1 Results of the first test 56 7.2 Results of the second test 56 7.3 Results of the extended second test 56 ix 1 Introduction Nowadays, we are surrounded by Big Data in many forms. Big Data can be seen in several domains, such as Internet of Things, social media, medicine, and astronomy [1]. They are used, for example, in data mining, machine learning, predictive analytics, and statistical techniques. Big Data brings many problems to developers because they have to make systems that can handle working with this type of data and their properties, such as huge volume, heterogeneity, or generation speed. Currently, open source solutions are very popular in this domain. Therefore multiple open source Big Data tools were created to allow working with these type of data. However, their enormous number, specific aims, and fast evolution make it confusing to choose the right solution for the given Big Data problem. We believe that creating a Big Data tool selection diagram would be a valid response to this issue. Such diagram should be able to rec- ommend the set of tools that should be used for the given Big Data problem. The elements of the output set should be based on the prop- erties of this problem. As this is beyond the scope of a master’s thesis, this thesis creates the initial version of Big Data selection diagram, which is expected to be updated and extended in the future. This thesis is organized as follows. Fundamental information about the Big Data domain and its specifics are introduced in chapter 2. Chapter 3 describes the challenges in Big Data tools. Proposed archi- tecture of Big Data tools is described in chapter 4. Chapter 5 contains the overview of Big Data storage tools, and chapter 6 contains the overview of Big Data processing tools. Contents attached to this thesis are described in chapter 7. Chapter 8 concludes the thesis. 1 2 Big Data This chapter contains the fundamental information about the Big Data domain. It should give the reader a necessary knowledge to understand the following chapters. 2.1 Characteristics Big Data are typically defined by five properties, called as "5 Vs ofBig Data" [2]. ∙ Volume: Used data have such a large size that they cannot fit into a single server, or the performance of analysis on those data on a single server is low. The relevant factor is also a data growth in time. Therefore, the systems that want to work with Big Data has to be scalable. ∙ Variety: Structure of the used data can be heterogeneous. Data can be classified by their structure into these three categories: structured data with a defined structure, for example, CSV files, and spreadsheets, semi-structured data with a flexible structure, for example, JSON, and XML, and unstructured data without a structure, for example, images, and videos [3]. ∙ Velocity: Data sources generate real-time data at a fast rate. For example, on Facebook, 136,000 photos are uploaded every minute [4]. So the system has to be able to handle lots of data at a reasonable speed. ∙ Veracity: Some data may have worse quality, and they cannot be considered trustworthy. So technologies should handle this kind of data too. ∙ Value: This property refers to the ability to extract a value from the data. Therefore systems have to provide useful benefits from the acquired data. Many other definitions emerged, including five parts definition [5], 7 Vs [6], 10Vs [7, 8], and 42 Vs [9] definition. However, the 5 Vs defini- tion is still considered as a popular standard. 3 2. Big Data 2.2 Big Data system requirements 2.2.1 Scalability Scalability is the ability of the system to manage increased demands. This ability is very relevant, because of the Big Data volume. The scal- ability can be categorized into the vertical or horizontal scaling [10]. Vertical scaling involves adding more processors, memory or faster hardware, typically, into a single server. Most of the software can then benefit from it. However, vertical scaling requires high financial investments, and there is a certain limit of this scaling. Horizontal scaling means adding more servers into a group of cooperating servers, called a cluster. These servers may be cheap com- modity machines, so the financial investment is relatively less. When this method is used, the system can scale as much as needed. However, it brings many complexities that software has to handle, which reflects on the limited number of software that can run on these systems.
Recommended publications
  • Working with Storm Topologies Date of Publish: 2018-08-13
    Apache Storm 3 Working with Storm Topologies Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Packaging Storm Topologies................................................................................... 3 Deploying and Managing Apache Storm Topologies............................................4 Configuring the Storm UI.................................................................................................................................... 4 Using the Storm UI.............................................................................................................................................. 5 Monitoring and Debugging an Apache Storm Topology......................................6 Enabling Dynamic Log Levels.............................................................................................................................6 Setting and Clearing Log Levels Using the Storm UI.............................................................................6 Setting and Clearing Log Levels Using the CLI..................................................................................... 7 Enabling Topology Event Logging......................................................................................................................7 Configuring Topology Event Logging.....................................................................................................8 Enabling Event Logging...........................................................................................................................8
    [Show full text]
  • Hadoop Tutorials  Cassandra  Hector API  Request Tutorial  About
    Home Big Data Hadoop Tutorials Cassandra Hector API Request Tutorial About LABELS: HADOOP-TUTORIAL, HDFS 3 OCTOBER 2013 Hadoop Tutorial: Part 1 - What is Hadoop ? (an Overview) Hadoop is an open source software framework that supports data intensive distributed applications which is licensed under Apache v2 license. At-least this is what you are going to find as the first line of definition on Hadoop in Wikipedia. So what is data intensive distributed applications? Well data intensive is nothing but BigData (data that has outgrown in size) anddistributed applications are the applications that works on network by communicating and coordinating with each other by passing messages. (say using a RPC interprocess communication or through Message-Queue) Hence Hadoop works on a distributed environment and is build to store, handle and process large amount of data set (in petabytes, exabyte and more). Now here since i am saying that hadoop stores petabytes of data, this doesn't mean that Hadoop is a database. Again remember its a framework that handles large amount of data for processing. You will get to know the difference between Hadoop and Databases (or NoSQL Databases, well that's what we call BigData's databases) as you go down the line in the coming tutorials. Hadoop was derived from the research paper published by Google on Google File System(GFS) and Google's MapReduce. So there are two integral parts of Hadoop: Hadoop Distributed File System(HDFS) and Hadoop MapReduce. Hadoop Distributed File System (HDFS) HDFS is a filesystem designed for storing very large files with streaming data accesspatterns, running on clusters of commodity hardware.
    [Show full text]
  • Apache Flink™: Stream and Batch Processing in a Single Engine
    Apache Flink™: Stream and Batch Processing in a Single Engine Paris Carboney Stephan Ewenz Seif Haridiy Asterios Katsifodimos* Volker Markl* Kostas Tzoumasz yKTH & SICS Sweden zdata Artisans *TU Berlin & DFKI parisc,[email protected][email protected][email protected] Abstract Apache Flink1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continu- ous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink’s architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model. 1 Introduction Data-stream processing (e.g., as exemplified by complex event processing systems) and static (batch) data pro- cessing (e.g., as exemplified by MPP databases and Hadoop) were traditionally considered as two very different types of applications. They were programmed using different programming models and APIs, and were exe- cuted by different systems (e.g., dedicated streaming systems such as Apache Storm, IBM Infosphere Streams, Microsoft StreamInsight, or Streambase versus relational databases or execution engines for Hadoop, including Apache Spark and Apache Drill). Traditionally, batch data analysis made up for the lion’s share of the use cases, data sizes, and market, while streaming data analysis mostly served specialized applications. It is becoming more and more apparent, however, that a huge number of today’s large-scale data processing use cases handle data that is, in reality, produced continuously over time.
    [Show full text]
  • Administration and Configuration Guide
    Red Hat JBoss Data Virtualization 6.4 Administration and Configuration Guide This guide is for administrators. Last Updated: 2018-09-26 Red Hat JBoss Data Virtualization 6.4 Administration and Configuration Guide This guide is for administrators. Red Hat Customer Content Services Legal Notice Copyright © 2018 Red Hat, Inc. This document is licensed by Red Hat under the Creative Commons Attribution-ShareAlike 3.0 Unported License. If you distribute this document, or a modified version of it, you must provide attribution to Red Hat, Inc. and provide a link to the original. If the document is modified, all Red Hat trademarks must be removed. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and other countries. Node.js ® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
    [Show full text]
  • MÁSTER EN INGENIERÍA WEB Proyecto Fin De Máster
    UNIVERSIDAD POLITÉCNICA DE MADRID Escuela Técnica Superior de Ingeniería de Sistemas Informáticos MÁSTER EN INGENIERÍA WEB Proyecto Fin de Máster …Estudio Conceptual de Big Data utilizando Spring… Autor Gabriel David Muñumel Mesa Tutor Jesús Bernal Bermúdez 1 de julio de 2018 Estudio Conceptual de Big Data utilizando Spring AGRADECIMIENTOS Gracias a mis padres Julian y Miriam por todo el apoyo y empeño en que siempre me mantenga estudiando. Gracias a mi tia Gloria por sus consejos e ideas. Gracias a mi hermano José Daniel y mi cuñada Yule por siempre recordarme que con trabajo y dedicación se pueden alcanzar las metas. [UPM] Máster en Ingeniería Web RESUMEN Big Data ha sido el término dado para aglomerar la gran cantidad de datos que no pueden ser procesados por los métodos tradicionales. Entre sus funciones principales se encuentran la captura de datos, almacenamiento, análisis, búsqueda, transferencia, visualización, monitoreo y modificación. Las empresas han visto en Big Data una poderosa herramienta para mejorar sus negocios en una economía mundial basada firmemente en el conocimiento. Los datos son el combustible para las compañías modernas y, por lo tanto, dar sentido a estos datos permite realmente comprender las conexiones invisibles dentro de su origen. En efecto, con mayor información se toman mejores decisiones, permitiendo la creación de estrategias integrales e innovadoras que garanticen resultados exitosos. Dada la creciente relevancia de Big Data en el entorno profesional moderno ha servido como motivación para la realización de este proyecto. Con la utilización de Java como software de desarrollo y Spring como framework web se desea analizar y comprobar qué herramientas ofrecen estas tecnologías para aplicar procesos enfocados en Big Data.
    [Show full text]
  • Apache Apex: Next Gen Big Data Analytics
    Apache Apex: Next Gen Big Data Analytics Thomas Weise <[email protected]> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14th 2016 Stream Data Processing Data Delivery Transform / Analytics Real-time visualization, … Declarative SQL API Data Beam Beam SAMOA Operator SAMOA DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2 Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • Large scale ingest and distribution • Enforcing data quality and data governance requirements • Real-time ELTA (Extract Load Transform Analyze) • Real-time data enrichment with reference data • Dimensional computation & aggregation • Real-time machine learning model scoring 3 Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster •
    [Show full text]
  • E6895 Advanced Big Data Analytics Lecture 4: Data Store
    E6895 Advanced Big Data Analytics Lecture 4: Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Chief Scientist, Graph Computing, IBM Watson Research Center E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, 2016 Columbia University Reference 2 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL 3 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL 4 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Apache Hive 5 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Hive to Create a Table 6 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Creating, Dropping, and Altering DBs in Apache Hive 7 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another Hive Example 8 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive’s operation modes 9 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using HiveQL for Spark SQL 10 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive Language Manual 11 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Spark SQL — Steps and Example 12 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Query testtweet.json Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files 13 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University SchemaRDD 14 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Row Objects Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.
    [Show full text]
  • The Cloud‐Based Demand‐Driven Supply Chain
    The Cloud-Based Demand-Driven Supply Chain Wiley & SAS Business Series The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions. Titles in the Wiley & SAS Business Series include: The Analytic Hospitality Executive by Kelly A. McGuire Analytics: The Agile Way by Phil Simon Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens A Practical Guide to Analytics for Governments: Using Big Data for Good by Marie Lowman Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs Business Analytics for Customer Intelligence by Gert Laursen Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs ii Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase Demand-Driven Inventory
    [Show full text]
  • Synthesis and Development of a Big Data Architecture for the Management of Radar Measurement Data
    1 Faculty of Electrical Engineering, Mathematics & Computer Science Synthesis and Development of a Big Data architecture for the management of radar measurement data Alex Aalbertsberg Master of Science Thesis November 2018 Supervisors: dr. ir. Maurice van Keulen (University of Twente) prof. dr. ir. Mehmet Aks¸it (University of Twente) dr. Doina Bucur (University of Twente) ir. Ronny Harmanny (Thales) University of Twente P.O. Box 217 7500 AE Enschede The Netherlands Approval Internship report/Thesis of: …………………………………………………………………………………………………………Alexander P. Aalbertsberg Title: …………………………………………………………………………………………Synthesis and Development of a Big Data architecture for the management of radar measurement data Educational institution: ………………………………………………………………………………..University of Twente Internship/Graduation period:…………………………………………………………………………..2017-2018 Location/Department:.…………………………………………………………………………………435 Advanced Development, Delft Thales Supervisor:……………………………………………………………………………R. I. A. Harmanny This report (both the paper and electronic version) has been read and commented on by the supervisor of Thales Netherlands B.V. In doing so, the supervisor has reviewed the contents and considering their sensitivity, also information included therein such as floor plans, technical specifications, commercial confidential information and organizational charts that contain names. Based on this, the supervisor has decided the following: o This report is publicly available (Open). Any defence may take place publicly and the report may be included in public libraries and/or published in knowledge bases. • o This report and/or a summary thereof is publicly available to a limited extent (Thales Group Internal). tors . It will be read and reviewed exclusively by teachers and if necessary by members of the examination board or review ? committee. The content will be kept confidential and not disseminated through publication or inclusion in public libraries and/or knowledge bases.
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • Assessment of Multiple Ingest Strategies for Accumulo Key-Value Store
    Assessment of Multiple Ingest Strategies for Accumulo Key-Value Store by Hai Pham A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science Auburn, Alabama May 7, 2016 Keywords: Accumulo, noSQL, ingest Copyright 2016 by Hai Pham Approved by Weikuan Yu, Co-Chair, Associate Professor of Computer Science, Florida State University Saad Biaz, Co-Chair, Professor of Computer Science and Software Engineering, Auburn University Sanjeev Baskiyar, Associate Professor of Computer Science and Software Engineering, Auburn University Abstract In recent years, the emergence of heterogeneous data, especially of the unstructured type, has been extremely rapid. The data growth happens concurrently in 3 dimensions: volume (size), velocity (growth rate) and variety (many types). This emerging trend has opened a new broad area of research, widely accepted as Big Data, which focuses on how to acquire, organize and manage huge amount of data effectively and efficiently. When coping with such Big Data, the traditional approach using RDBMS has been inefficient; because of this problem, a more efficient system named noSQL had to be created. This thesis will give an overview knowledge on the aforementioned noSQL systems and will then delve into a more specific instance of them which is Accumulo key-value store. Furthermore, since Accumulo is not designed with an ingest interface for users, this thesis focuses on investigating various methods for ingesting data, improving the performance and dealing with numerous parameters affecting this process. ii Acknowledgments First and foremost, I would like to express my profound gratitude to Professor Yu who with great kindness and patience has guided me through not only every aspect of computer science research but also many great directions towards my personal issues.
    [Show full text]
  • Horn: a System for Parallel Training and Regularizing of Large-Scale Neural Networks
    Horn: A System for Parallel Training and Regularizing of Large-Scale Neural Networks Edward J. Yoon [email protected] I Am ● Edward J. Yoon ● Member and Vice President of Apache Software Foundation ● Committer, PMC, Mentor of ○ Apache Hama ○ Apache Bigtop ○ Apache Rya ○ Apache Horn ○ Apache MRQL ● Keywords: big data, cloud, machine learning, database What is Apache Software Foundation? The Apache Software Foundation is an Non-profit foundation that is dedicated to open source software development 1) What Apache Software Foundation is, 2) Which projects are being developed, 3) What’s HORN? 4) and How to contribute them. Apache HTTP Server (NCSA HTTPd) powers nearly 500+ million websites (There are 644 million websites on the Internet) And Now! 161 Top Level Projects, 108 SubProjects, 39 Incubating Podlings, 4700+ Committers, 550 ASF Members Unknown number of developers and users Domain Diversity Programming Language Diversity Which projects are being developed? What’s HORN? ● Oct 2015, accepted as Apache Incubator Project ● Was born from Apache Hama ● A System for Deep Neural Networks ○ A neuron-level abstraction framework ○ Written in Java :/ ○ Works on distributed environments Apache Hama 1. K-means clustering Hama is 1,000x faster than Apache Mahout At UT Arlington & Oracle 2013 2. PageRank on 10 Billion edges Graph Hama is 3x faster than Facebook’s Giraph At Samsung Electronics (Yoon & Kim) 2015 3. Top-k Set Similarity Joins on Flickr Hama is clearly faster than Apache Spark At IEEE 2015 (University of Melbourne) Why we do this? 1. How to parallelize the training of large models? 2. How to avoid overfitting due to large size of the network, even with large datasets? JonathanNet Distributed Training Parameter Server Parameter Server Parameter Swapping Task 5 Each group performs Task 2 Task 4 Task 3 ..
    [Show full text]