Lambda Architecture for Distributed Stream Processing in the Fog

Total Page:16

File Type:pdf, Size:1020Kb

Lambda Architecture for Distributed Stream Processing in the Fog Lambda Architecture for Distributed Stream Processing in the Fog DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmendes Studiums Software Engineering &Internet Computing eingereicht von Matthias Schrabauer,BSc Matrikelnummer 01326214 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Associate Prof.Dr.-Ing. Stefan Schulte Wien, 2. Februar 2021 Matthias Schrabauer Stefan Schulte Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Lambda Architecture for Distributed Stream Processing in the Fog DIPLOMA THESIS submitted in partial fulfillment of the requirements forthe degree of Diplom-Ingenieur in Software Engineering &Internet Computing by Matthias Schrabauer,BSc Registration Number 01326214 to the Faculty of Informatics at the TU Wien Advisor: Associate Prof.Dr.-Ing. Stefan Schulte Vienna, 2nd February, 2021 Matthias Schrabauer Stefan Schulte Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassungder Arbeit Matthias Schrabauer,BSc Hiermit erkläre ich, dass ichdieseArbeit selbständig verfasst habe,dass ichdie verwen- detenQuellenund Hilfsmittel vollständig angegeben habeund dass ichdie Stellen der Arbeit–einschließlichTabellen,Karten undAbbildungen –, dieanderen Werken oder dem Internet im Wortlaut oder demSinn nach entnommensind, aufjeden Fall unter Angabeder Quelleals Entlehnung kenntlich gemacht habe. Wien, 2. Februar 2021 Matthias Schrabauer v Danksagung An dieser Stelle möchte ichmichbei allen Personen bedanken, die michwährendder Erstellung dieserArbeitunterstütztund motivierthaben. Besonders möchte ichmichbei Herrn Associate Prof. Dr.-Ing. Stefan Schulte bedanken, der diese Arbeitbetreutund begutachtet hat.Für die zahlreichen, hilfreichen Anregungen und die konstruktiveKritik beider Erstellung dieser Arbeitmöchteich michherzlich bedanken. Ebenfalls möchte ichmichbei meinenMitstudenten und Mitstudentinnenbedanken, die mir immer hilfsbereit zurSeite standen, wenn icheine zweite Meinung benötigte oder ich technischeDetails diskutierenwollte. Abschließend möchteich michbei meinenEltern bedanken, die mir durchihre Unterstüt- zung meinStudium erst ermöglichthaben. vii Acknowledgements Iwanttouse this opportunity to thank all thepeoplewho supported and motivatedme duringthe writing of this work. Especially,Iwant to thank my advisor Associate Prof. Dr.-Ing. Stefan Schulte who supervised andreviewed this thesis. Iamespeciallythankfulfor hisnumeroushelpful suggestions and theconstructive criticism he offered me throughoutthe writing of this thesis. Iwouldalsoliketothank my fellowstudents, who have alwaysbeen helpfulwhen I needed asecondopinion or someone to discuss technical details. Finally,Iwould liketothank my parents, whosesupport made my studies possible in the first place. ix Kurzfassung Der digitale Wandelführt zu einem stetig wachsendenDatenaufkommen. Mitdem Wachstum von“Big Data”steigtder Bedarf, diese großenDatenmengen zu analysieren und nutzbringend zu verwenden (bezeichnet als Stapelverarbeitung). Dazu habensich Programmiermodelle, Frameworks, Plattformen undTools wiedas Apache Hadoop- Ökosystem und das MapReduce-Programmiermodell etabliert.Solche Systeme sindfür die Stapelverarbeitung konzipiert und eignen sichdahernicht für die Verarbeitung vonDatenströmen.Mit dem Aufkommenvon Anwendungsszenarien wieSmartCities und autonomenFahrensteigtder Bedarf,kontinuierliche Datenströme in Echtzeitzu verarbeiten. Zu diesem ZweckwerdenFrameworks zur Datenstromverarbeitung wie Apache Storm oder ApacheFlink eingesetzt.AllerdingsarbeitensolcheFrameworks in der Regel in derCloud innerhalb eines lokalen Clustersmit geringer Latenz. Für Internet of Things (IoT) Anwendungen führt dieser zentralisierte Ansatz oft zu hohen Latenzen,daDatenströme (z.B.Sensordaten)erst in dieCloud geschicktwerdenmüssen, um sie zu verarbeiten. Um dieses Problem zu adressieren undIoT Daten effizientzu verarbeiten, reicht die Cloud allein nichtmehraus. Es gibteinenzunehmenden Trend, die Verarbeitung vonDatennäheranden Rand desNetzwerkszuverlagern,wodie Daten erzeugtund gespeichertwerden. Um sowohl die Vorteile derStapel-als auchder Datenstromverarbeitung zu nutzen, wurde die Lambda-Architektur eingeführt.Diese Architekturbasiert auf drei Schichten, dieesermöglichen,großeMengen an historischen Daten effizient zu verarbeiten(“Batch-Schicht”und “Serving-Schicht”) undgleichzeitig kontinuierliche DatenströmeninEchtzeitzuprozessieren (“Speed-Schicht”). Zieldieser Arbeitist es, einen Lösungsansatzzuentwerfenund zu implementieren, der sowohl die Lambda-Architektur als auchFog-Computingnutzt, um Datenströme in Echtzeit zu verarbeiten. Die Evaluierung konzentriert sichdarauf,wie gut Fog-basierteDatenstromverarbeitungs- TopologienimVergleichzueinemtraditionellen Cloud-Ansatzabschneiden.Für eine quantitative Bewertung werdengängige Metriken aus demBereichder Datenverarbeitung verwendet (Latenz, Round-Trip-Zeitvon Datenpaketen). Die Evaluierungdes Lösungs- ansatzes zeigt, dassder Einsatzvon verteilter Datenstromverarbeitung in der Fogeine vielversprechende Alternative zur traditionellen Datenverarbeitung in der Cloud sein kann. Insgesamtzeigen dieErgebnisse eine Verringerung derRound-Trip-Zeiten. Insbesondere, wenn die Latenz zurCloud über 50 ms liegt oder dieDatenpaketgröße recht groß ist. xi Abstract The digital transformation is leading to aconstantlyincreasingvolumeofdata. With the growthofbig data,there is arising demandfor analyzingand making use of those largepiles of data(referredtoasbatch processing). To do that, programming models, frameworks, platforms, and tools suchasthe Apache Hadoop ecosystem and the MapReduceprogramming modelhavebeenestablished. Suchsystemshavebeendesigned forbatch processing and are thereforenot suitablefor (real-time) stream processing. With application scenarioslikesmartcitiesand autonomousdrivingemerging, thereisa growing need to process continuous streams of data close to real-time. Forthis purpose, distributed stream processing frameworkssuchasApacheStormorApacheFlink are used to analyze data streams. However, such frameworks usuallyoperate in the cloud within alocal cluster with lowlatency.For Internet of Things (IoT) applications,this centralizedapproachoften leads to highlatency, since data streams (e.g.,sensor data) must be senttothe cloudfirst, in order to process it.Toaddress this issue and to efficiently processIoT dataonalarge scale, thecloud alone is no longer sufficient. There is an increasingtrend to push the processing of datacloser to theedgeofthe network, where thedataisgenerated and stored.Inorder to takeadvantage of both batchand stream processing,the lambdaarchitecture design patternhas been introduced. This architectural styleisbasedonthree layers, whichallowtoefficientlyprocess massive volumesofhistoricaldata (batchand serving layer) while simultaneously using stream processing to provide areal-time analysis of continuous datastreams (speed layer). Thegoal of this work is to design and implementasolution approach, which makes use of thelambdaarchitecture as well as fog computingtoprocess data streams in real-time. The evaluation focuses on howwellfog-basedstreamprocessing topologiesperform compared to atraditionalcloud approach. Common metrics in the field of dataprocessing are used foraquantitativeevaluation(latency,round-triptime of datapackets). The evaluation of thesolution approachshows thatusing distributedstream processing in the fogcan be averypromising alternativecompared to traditional dataprocessing in the cloud. Overall, the resultsshowadecreaseinthe round-trip times. Especially if the latency to the cloud is over 50 ms or thedata packetsize is quite large. xiii Contents Kurzfassung xi Abstract xiii Contents xv 1Introduction1 1.1 Motivation and Problem Statement. .................... 1 1.2 Aim of theWork ..............................2 1.3 Methodology and Approach........................3 1.4 Structure ..................................5 2Background 7 2.1 InternetofThings............................. 7 2.2 FogComputing ................................11 2.3 BigDataAnalytics............................. 15 2.4 LambdaArchitecture ............................20 3Related Work 25 3.1 DistributedStream Processing for theIoT and FogComputing ....25 3.2 LambdaArchitecture forDistributedStreamProcessing ........35 3.3 FogComputing Infrastructure ....................... 37 3.4 Conclusion ................................. 38 4RequirementsAnalysis and Design 41 4.1 Requirements .................................41 4.2 Architecture ................................. 44 5Implementation 53 5.1 Infrastructure Setup............................53 5.2 DevelopmentOperations .......................... 56 5.3 Implementation of theLambdaArchitecture ............... 58 5.4 Implementation of Non-Functional Requirements ............ 62 5.5 Limitations .................................63 xv 6Evaluation 65 6.1 DataSets.................................. 65 6.2 Motivational Scenario........................... 66 6.3 Testbed................................... 67 6.4 Topology..................................69 6.5 Benchmarks .................................. 71 6.6 Summary.................................. 83 7Conclusion and Future Work 87 7.1 Discussion ..................................87 7.2 Future Work ................................89 ListofFigures 91 List of Tables93 Acronyms 95 Bibliography 97 CHAPTER 1 Introduction 1.1 Motivation and Problem Statement With thegrowth of bigdata, there is arising demandfor
Recommended publications
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • 60 Recipes for Apache Cloudstack
    60 Recipes for Apache CloudStack Sébastien Goasguen 60 Recipes for Apache CloudStack by Sébastien Goasguen Copyright © 2014 Sébastien Goasguen. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editor: Brian Anderson Indexer: Ellen Troutman Zaig Production Editor: Matthew Hacker Cover Designer: Karen Montgomery Copyeditor: Jasmine Kwityn Interior Designer: David Futato Proofreader: Linley Dolby Illustrator: Rebecca Demarest September 2014: First Edition Revision History for the First Edition: 2014-08-22: First release See http://oreilly.com/catalog/errata.csp?isbn=9781491910139 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. 60 Recipes for Apache CloudStack, the image of a Virginia Northern flying squirrel, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
    [Show full text]
  • Hortonworks Data Platform May 29, 2015
    docs.hortonworks.com Hortonworks Data Platform May 29, 2015 Hortonworks Data Platform : Data Integration Services with HDP Copyright © 2012-2015 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 3.0 License. http://creativecommons.org/licenses/by-sa/3.0/legalcode ii Hortonworks Data Platform May 29, 2015 Table of Contents 1.
    [Show full text]
  • Performance Prediction of Data Streams on High-Performance
    Gautam and Basava Hum. Cent. Comput. Inf. Sci. (2019) 9:2 https://doi.org/10.1186/s13673-018-0163-4 RESEARCH Open Access Performance prediction of data streams on high‑performance architecture Bhaskar Gautam* and Annappa Basava *Correspondence: bhaskar.gautam2494@gmail. Abstract com Worldwide sensor streams are expanding continuously with unbounded velocity in Department of Computer Science and Engineering, volume, and for this acceleration, there is an adaptation of large stream data processing National Institute system from the homogeneous to rack-scale architecture which makes serious con- of Technology Karnataka, cern in the domain of workload optimization, scheduling, and resource management Surathkal, India algorithms. Our proposed framework is based on providing architecture independent performance prediction model to enable resource adaptive distributed stream data processing platform. It is comprised of seven pre-defned domain for dynamic data stream metrics including a self-driven model which tries to ft these metrics using ridge regularization regression algorithm. Another signifcant contribution lies in fully-auto- mated performance prediction model inherited from the state-of-the-art distributed data management system for distributed stream processing systems using Gaussian processes regression that cluster metrics with the help of dimensionality reduction algorithm. We implemented its base on Apache Heron and evaluated with proposed Benchmark Suite comprising of fve domain-specifc topologies. To assess the pro- posed methodologies, we forcefully ingest tuple skewness among the benchmark- ing topologies to set up the ground truth for predictions and found that accuracy of predicting the performance of data streams increased up to 80.62% from 66.36% along with the reduction of error from 37.14 to 16.06%.
    [Show full text]
  • Aligning Machine Learning for the Lambda Architecture
    Aalto University School of Science Degree Programme in Computer Science and Engineering Visakh Nair Aligning Machine Learning for the Lambda Architecture Master’s Thesis Espoo, September 24, 2015 Supervisor: Assoc. Prof. Keijo Heljanko, Aalto University Advisor: Olli Luukkonen, D.Sc. (Tech.), Tieto Finland Oy Aalto University School of Science ABSTRACT OF Degree Programme in Computer Science and Engineering MASTER’S THESIS Author: Visakh Nair Title: Aligning Machine Learning for the Lambda Architecture Date: September 24, 2015 Pages: 61 Major: Machine Learning and Data Mining Code: T-110 Supervisor: Assoc. Prof. Keijo Heljanko Advisor: Olli Luukkonen, D.Sc. (Tech.), Tieto Finland Oy We live in the era of Big Data. Web logs, internet media, social networks and sensor devices are generating petabytes of data every day. Traditional data stor- age and analysis methodologies have become insufficient to handle the rapidly increasing amount of data. The development of complex machine learning tech- niques has led to the proliferation of advanced analytics solutions. This has led to a paradigm shift in the way we store, process and analyze data. The avalanche of data has led to the development of numerous platforms and solutions satisfying various business analytics needs. It becomes imperative for the business practitioners and consultants to choose the right solution which can provide the best performance and maximize the utilization of the data available. In this thesis, we develop and implement a Big Data architectural framework called the Lambda Architecture. It consists of three major components, namely batch data processing, realtime data processing and a reporting layer. We develop and implement analytics use cases using machine learning techniques for each of these layers.
    [Show full text]
  • ACNA2011: Apache Rave: Enterprise Social Networking out of The
    Apache Rave Enterprise Social Networking Out Of The Box Ate Douma, Hippo B.V. Matt Franklin, The MITRE Corporation November 9, 2011 Overview ● About us ● What is Apache Rave? ● History ● Projects and people behind Rave ● The Project ● Demo ● Goals & Roadmap ● More demos and examples ● Other projects using Rave ● Participate Apache Rave: Enterprise Social Networking Out Of The Box About us Ate Douma Matt Franklin Chief Architect at Lead Software Engineer at Hippo B.V. The MITRE Corporation's Center of Open source CMS and Portal Software Information & Technology Apache Champion, Mentor and Committer Apache PPMC Member and Committer of Apache Rave of Apache Rave [email protected] [email protected] [email protected] [email protected] [email protected] twitter: @atedouma twitter: @mattfranklin Apache Rave: Enterprise Social Networking Out Of The Box What is Apache Rave? Apache Rave (incubating) is a lightweight and extensible Web and Social Mashup engine, to host, serve and aggregate Gadgets, Widgets and general (social) network and web services with a highly customizable Web 2.0 friendly front-end. ● Targets Enterprise-level intranet, extranet, portal, web and mobile sites ● Can be used 'out-of-the-box' or as an embeddable engine ● Transparent integration and usage of OpenSocial Gadgets, W3C Widgets, …, ● Built upon a highly extensible and pluggable component architecture ● Will enhance this with context-aware cross-component communication, collaboration and content integration features ● Leverages latest/open standards and related open source
    [Show full text]
  • Real-Time Stream Processing for Big Data
    it – Information Technology 2016; 58(4): 186–194 DE GRUYTER OLDENBOURG Special Issue Wolfram Wingerath*, Felix Gessert, Steffen Friedrich, and Norbert Ritter Real-time stream processing for Big Data DOI 10.1515/itit-2016-0002 1 Introduction Received January 15, 2016; accepted May 2, 2016 Abstract: With the rise of the web 2.0 and the Internet of Through technological advance and increasing connec- things, it has become feasible to track all kinds of infor- tivity between people and devices, the amount of data mation over time, in particular fine-grained user activi- available to (web) companies, governments and other or- ties and sensor data on their environment and even their ganisations is constantly growing. The shift towards more biometrics. However, while efficiency remains mandatory dynamic and user-generated content in the web and the for any application trying to cope with huge amounts of omnipresence of smart phones, wearables and other mo- data, only part of the potential of today’s Big Data repos- bile devices, in particular, have led to an abundance of in- itories can be exploited using traditional batch-oriented formation that are only valuable for a short time and there- approaches as the value of data often decays quickly and fore have to be processed immediately. Companies like high latency becomes unacceptable in some applications. Amazon and Netflix have already adapted and are mon- In the last couple of years, several distributed data pro- itoring user activity to optimise product or video recom- cessing systems have emerged that deviate from the batch- mendations for the current user context.
    [Show full text]
  • Building a Scalable Distributed Data Platform Using Lambda Architecture
    Building a scalable distributed data platform using lambda architecture by DHANANJAY MEHTA B.Tech., Graphic Era University, India, 2012 A REPORT submitted in partial fulfillment of the requirements for the degree MASTER OF SCIENCE Department Of Computer Science College Of Engineering KANSAS STATE UNIVERSITY Manhattan, Kansas 2017 Approved by: Major Professor Dr. William H. Hsu Copyright Dhananjay Mehta 2017 Abstract Data is generated all the time over Internet, systems, sensors and mobile devices around us this data is often referred to as 'big data'. Tapping this data is a challenge to organiza- tions because of the nature of data i.e. velocity, volume and variety. What make handling this data a challenge? This is because traditional data platforms have been built around relational database management systems coupled with enterprise data warehouses. Legacy infrastructure is either technically incapable to scale to big data or financially infeasible. Now the question arises, how to build a system to handle the challenges of big data and cater needs of an organization? The answer is Lambda Architecture. Lambda Architecture (LA) is a generic term that is used for a scalable and fault-tolerant data processing architecture that ensure real-time processing with low latency. LA provides a general strategy to knit together all necessary tools for building a data pipeline for real- time processing of big data. LA builds a big data platform as a series of layers that combine batch and real time processing. LA comprise of three layers - Batch Layer, responsible for bulk data processing; Speed Layer, responsible for real-time processing of data streams and Serving Layer, responsible for serving queries from end users.
    [Show full text]
  • Introduction to Big Data & Architectures
    Introduction to Big Data & Architectures This project has received funding from the European Union's Horizon 2020 Research and Innovation programme under grant agreement No 809965. About us 2 Smart Data Analytics (SDA) ❖ Prof. Dr. Jens Lehmann ■ Institute for Computer Science , University of Bonn ■ Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) ■ Institute for Applied Computer Science, Leipzig. ❖ Machine learning techniques ("analytics") for Structured knowledge ("smart data") Covering the full spectrum of research including theoretical foundations, algorithms, prototypes and industrial applications! 3 SDA Group Overview • Founded in 2016 • 55 Members: – 1 Professor – 13 PostDocs – 31 PhD Students – 11 master students • Core topics: – Semantic Web – AI / ML • 10+ awards acquired • 3000+ citations / year • Collaboration with Fraunhofer IAIS 4 SDA Group Overview ❖ Distributed Semantic Analytics ➢ Aims to develop scalable analytics algorithms based on Apache Spark and Apache Flink for analysing large scale RDF datasets ❖ Semantic Question Answering ➢ Make use of Semantic Web technologies and AI for better and advanced question answering & dialogue systems ❖ Structured Machine Learning ➢ Combines Semantic Web and supervised ML technologies in order to improve both quality and quantity of available knowledge ❖ Smart Services ➢ Semantic services and their composition, applications in IoT ❖ Software Engineering for Data Science ➢ Researches on how data and software engineering methods can be aligned with Data Science
    [Show full text]
  • Hadoop Programming Options
    "Web Age Speaks!" Webinar Series Hadoop Programming Options Introduction Mikhail Vladimirov Director, Curriculum Architecture [email protected] Web Age Solutions Providing a broad spectrum of regular and customized training classes in programming, system administration and architecture to our clients across the world for over ten years ©WebAgeSolutions.com 2 Overview of Talk Hadoop Overview Hadoop Analytics Systems HDFS and MapReduce v1 & v2 (YARN) Hive Sqoop ©WebAgeSolutions.com 3 Hadoop Programming Options Hadoop Ecosystem Hadoop Hadoop is a distributed fault-tolerant computing platform written in Java Modeled after shared-nothing, massively parallel processing (MPP) system design Hadoop's design was influenced by ideas published in Google File System (GFS) and MapReduce white papers Hadoop can be used as a data hub, data warehouse or an analytic platform ©WebAgeSolutions.com 5 Hadoop Core Components The Hadoop project is made up of three main components: Common • Contains Hadoop infrastructure elements (interfaces with HDFS, system libraries, RPC connectors, Hadoop admin scripts, etc.) Hadoop Distributed File System • Hadoop Distributed File System (HDFS) running on clusters of commodity hardware built around the concept: load once and read many times MapReduce • A distributed data processing framework used as data analysis system ©WebAgeSolutions.com 6 Hadoop Simple Definition In a nutshell, Hadoop is a distributed computing framework that consists of: Reliable data storage (provided via HDFS) Analysis system
    [Show full text]
  • Betriebliche Informationssysteme: Grid-Basierte Integration Und Orchestrierung
    Wilhelm Hasselbring (Hrsg.) Betriebliche Informationssysteme: Grid-basierte Integration und Orchestrierung Schlussbericht 1 Das diesem Bericht zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums für Bildung und Forschung unter dem Förderkennzeichen 01IG07005 gefördert. Die Verantwortung für den Inhalt dieser Veröffentlichung liegt bei dem Autor. 2 Vorwort BIS-Grid startete im Mai 2007 als eines der ersten rein kommerziell orientierten Projekte in der zweiten Phase der D-Grid-Initiative des BMBF. Das Ziel, einen Integrations- und Orchestrierungsdienst per Grid-Providing anzubieten, war damals sehr innovativ und ist es auch heute noch. Während der Projektlaufzeit hat der zu Beginn des Projekts noch nicht ausgeprägte Begriff des „Cloud Computing“ zunehmende Aufmerksamkeit erlangt. Es hat sich gezeigt, dass BIS-Grid genau in diese neue Kategorie von Diensten einzuordnen ist. Traditionell fokussiert das Grid auf wissenschaftliche Berechnungen, die Cloud wird überwiegend von kommerziellen Providern betrieben. Inzwischen existieren auch erste kommerzielle Dienste, die dem im BIS-Grid-Projekt geprägten „Orchestration as a Service (OaaS)“ Ansatz entsprechen. Beispiele sind die Azure .NET Workflow Services, Iceberg on Demand und Appian Anywhere. In BIS-Grid haben wir erfolgreich interdisziplinär auf der technischen Ebene (die BIS-Grid-Engine), auf der organisatorischen Ebene (Kooperations- und Geschäftsmodelle) und auf der empirischen Ebene (Evaluation in industriellen Anwendungsszenarien) gearbeitet. Auf den drei jährlichen Grid Workflow Workshops haben wir unsere Ergebnisse verbreitet und uns mit anderen Projekten ausgetauscht. Zum Projektende steht die BIS-Grid-Engine als Open Source Software zur Verfügung. Die konzeptionellen, technischen und empirischen Ergebnisse werden im hier vorliegenden Abschlussbericht für die Fachöffentlichkeit dokumentiert. An dieser Stelle soll ein kurzer Dank an alle Projektbeteiligten aus immerhin acht Unternehmen und wissenschaftlichen Einrichtungen gehen.
    [Show full text]
  • Apache Pulsar and Its Enterprise Use Cases
    Apache Pulsar and its enterprise use cases Yahoo Japan Corporation Nozomi Kurihara July 18th, 2018 Who am I? Nozomi Kurihara • Software engineer at Yahoo! JAPAN (April 2012 ~) • Working on internal messaging platform using Apache Pulsar • Committer of Apache Pulsar Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved. 2 Agenda 1. What is Apache Pulsar? 2. Why is Apache Pulsar useful? 3. How does Yahoo! JAPAN uses Apache Pulsar? Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved. 3 What is Apache Pulsar? Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved. 4 Agenda 1. What is Apache Pulsar? › History & Users › Pub-Sub messaging › Architecture › Client libraries › Topic › Subscription › Sample codes 2. Why is Apache Pulsar useful? 3. How does Yahoo! JAPAN uses Apache Pulsar? Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved. 5 Apache Pulsar Flexible pub-sub system backed by durable log storage ▪ History: ▪ Competitors: › 2014 Development started at Yahoo! Inc. › Apache Kafka › 2015 Available in production in Yahoo! Inc. › RabbitMQ › Sep. 2016 Open-sourced (Apache License 2.0) › Apache ActiveMQ › June 2017 Moved to Apache Incubator Project › Apache RocketMQ › June 2018 Major version update: 2.0.1 etc. ▪ Users: › Oath Inc. (Yahoo! Inc.) › Comcast › The Weather Channel › Mercado Libre › Streamlio › Yahoo! JAPAN etc. Copyright (C) 2018 Yahoo Japan Corporation. All Rights Reserved. 6 Pub-Sub messaging Message transmission from one system to another via Topic ▪ Producers publish messages to Topics ▪ Consumers receive only messages from Topics to which they subscribe ▪ Decoupled (no need to know each other) → asynchronous, scalable, resilient Subscribe Consumer 1 Publish Producer Topic Consumer 2 message (log, notification, etc.) Consumer 3 Pub-Sub system Copyright (C) 2018 Yahoo Japan Corporation.
    [Show full text]