Apache Samza

Total Page:16

File Type:pdf, Size:1020Kb

Apache Samza Apache Samza Martin Kleppmann Definition vehicles, or the writes of records to a database. Apache Samza is an open source frame- Stream processing jobs are long- work for distributed processing of high- running processes that continuously volume event streams. Its primary design consume one or more event streams, goal is to support high throughput for a invoking some application logic on wide range of processing patterns, while every event, producing derived output providing operational robustness at the streams, and potentially writing output massive scale required by Internet com- to databases for subsequent querying. panies. Samza achieves this goal through While a batch process or a database a small number of carefully designed ab- query typically reads the state of a stractions: partitioned logs for messag- dataset at one point in time, and then ing, fault-tolerant local state, and cluster- finishes, a stream processor is never based task scheduling. finished: it continually awaits the arrival of new events, and it only shuts down when terminated by an administrator. Many tasks can be naturally ex- Overview pressed as stream processing jobs, for example: Stream processing is playing an increas- • aggregating occurrences of events, ingly important part of the data man- e.g., counting how many times a agement needs of many organizations. particular item has been viewed; Event streams can represent many kinds • computing the rate of certain events, of data, for example, the activity of users e.g., for system diagnostics, report- on a website, the movement of goods or ing, and abuse prevention; 1 2 Martin Kleppmann • enriching events with information the scalability of Samza is directly at- from a database, e.g., extending user tributable to the choice of these founda- click events with information about tional abstractions. The remainder of this the user who performed the action; article provides further detail on those • joining related events, e.g., joining an design decisions and their practical con- event describing an email that was sequences. sent with any events describing the user clicking links in that email; • updating caches, materialized views, and search indexes, e.g., maintaining Partitioned Log Processing an external full-text search index over text in a database; A Samza job consists of a set of Java • using machine learning systems Virtual Machine (JVM) instances, called to classify events, e.g., for spam tasks, that each processes a subset of filtering. the input data. The code running in each JVM comprises the Samza framework Apache Samza, an open source and user code that implements the re- stream processing framework, can be quired application-specific functionality. used for any of the above applications The primary API for user code is the (Kleppmann and Kreps, 2015; Noghabi Java interface StreamTask, which de- et al, 2017). It was originally developed fines a method process(). Figure 1 at LinkedIn, then donated to the Apache shows two examples of user classes im- Software Foundation in 2013, and plementing the StreamTask interface. became a top-level Apache project in Once a Samza job is deployed and 2015. Samza is now used in production initialized, the framework calls the at many Internet companies, including process() method once for every LinkedIn (Paramasivam, 2016), Netflix message in any of the input streams. (Netflix Technology Blog, 2016), Uber The execution of this method may have (Chen, 2016; Hermann and Del Balso, various effects, including querying 2017), and TripAdvisor (Calisi, 2016). or updating local state and sending Samza is designed for usage scenar- messages to output streams. This model ios that require very high throughput: of computation is closely analogous to in some production settings, it pro- a map task in the well-known MapRe- cesses millions of messages per second duce programming model (Dean and or trillions of events per day (Feng, Ghemawat, 2004), with the difference 2015; Paramasivam, 2016; Noghabi that a Samza job’s input is typically et al, 2017). Consequently, the design never-ending (unbounded). of Samza prioritizes scalability and Similarly to MapReduce, each Samza operational robustness above most other task is a single-threaded process that it- concerns. erates over a sequence of input records. The core of Samza consists of sev- The inputs to a Samza job are partitioned eral fairly low-level abstractions, on top into disjoint subsets, and each input par- of which high-level operators have been tition is assigned to exactly one process- built (Pathirage et al, 2016). However, ing task. More than one partition may be the core abstractions have been carefully assigned to the same processing task, in designed for operational robustness, and Apache Samza 3 class SplitWords implements StreamTask { class CountWords implements StreamTask, InitableTask { static final SystemStream WORD_STREAM = new SystemStream("kafka", "words"); private KeyValueStore<String, Integer> store; public void process( public void init(Config config, IncomingMessageEnvelope in, TaskContext context) { MessageCollector out, store = (KeyValueStore<String, Integer>) TaskCoordinator _) { context.getStore("word-counts"); } String str = (String) in.getMessage(); public void process( for (String word : str.split(" ")) { IncomingMessageEnvelope in, out.send( MessageCollector out, new OutgoingMessageEnvelope( TaskCoordinator _) { WORD_STREAM, word, 1)); } String word = (String) in.getKey(); } Integer inc = (Integer) in.getMessage(); } Integer count = store.get(word); if (count == null) count = 0; store.put(word, count + inc); } } Fig. 1 The two operators of a streaming word-frequency counter using Samza’s StreamTask API (Image source: Kleppmann and Kreps (2015), c 2015 IEEE, reused with permission) Partition 0 “hello world” “hello samza” Split “hello” “hello” “interesting” Count Partition 1 “samza is interesting” Split “world” “samza” “samza” “is” Count Kafka topic strings Samza job Kafka topic words Samza job SplitWords CountWords Fig. 2 A Samza task consumes input from one partition, but can send output to any partition (Image source: Kleppmann and Kreps (2015), c 2015 IEEE, reused with permission) Output stream “hello” “hello” “interesting” Count —hello“ : 1 —hello“ : 2 —interesting“ : 1 Output stream “world” “samza” “samza” “is” Count —world“ : 1 —samza“ : 1 —samza“ : 2 —is“ : 1 Kafka topic words Samza job CountWords Kafka topic word_counts Fig. 3 A task’s local state is made durable by emitting a changelog of key-value pairs to Kafka (Image source: Kleppmann and Kreps (2015), c 2015 IEEE, reused with permission) 4 Martin Kleppmann which case the processing of those par- this log available for applications to titions is interleaved on the task thread. consume (Das et al, 2012; Qiao et al, However, the number of partitions in the 2013). Processing the stream of writes input determines the job’s maximum de- to a database enables jobs to maintain gree of parallelism. external indexes or materialized views The log interface assumes that each onto data in a database and is especially partition of the input is a totally ordered relevant in conjunction with Samza’s sequence of records and that each record support for local state (see Section is associated with a monotonically in- “Fault-tolerant Local State”). creasing sequence number or identifier While every partition of an input (known as offset). Since the records in stream is assigned to one particular task each partition are read sequentially, a of a Samza job, the output partitions job can track its progress by periodically are not bound to tasks. That is, when a writing the offset of the last read record StreamTask emits output messages, to durable storage. If a stream processing it can assign them to any partition of task is restarted, it resumes consuming the output stream. This fact can be the input from the last recorded offset. used to group related data items into Most commonly, Samza is used the same partition: for example, in the in conjunction with Apache Kafka word-counting application illustrated (see separate article on Kafka). Kafka in Figure 2, the SplitWords task provides a partitioned, fault-tolerant log chooses the output partition for each that allows publishers to append mes- word based on a hash of the word. This sages to a log partition, and consumers ensures that when different tasks en- (subscribers) to sequentially read the counter occurrences of the same word, messages in a log partition (Wang et al, they are all written to the same output 2015; Kreps et al, 2011; Goodhope partition, from where a downstream job et al, 2012). Kafka also allows stream can read and aggregate the occurrences. processing jobs to reprocess previously When stream tasks are composed seen records by resetting the consumer into multistage processing pipelines, the offset to an earlier position, a fact that is output of one task becomes the input to useful during recovery from failures. another task. Unlike many other stream However, Samza’s stream interface is processing frameworks, Samza does not pluggable: besides Kafka, it can use any implement its own message transport storage or messaging system as input, layer to deliver messages between provided that the system can adhere to stream operators. Instead, Kafka is used the partitioned log interface. By default, for this purpose; since Kafka writes all Samza can also read files from the messages to disk, it provides a large Hadoop Distributed Filesystem (HDFS) buffer between stages of the processing as input, in a way that parallels MapRe- pipeline, limited only by the available duce jobs, at competitive performance disk space on the Kafka brokers. (Noghabi et al, 2017). At LinkedIn, Typically, Kafka is configured to re- Samza is commonly deployed with tain several days or weeks worth of mes- Databus inputs: Databus is a change sages in each topic. Thus, if one stage data capture technology that records the of a processing pipeline fails or begins log of writes to a database and makes to run slow, Kafka can simply buffer the Apache Samza 5 input to that stage while leaving ample point, an additional consumer can be at- time for the problem to be resolved.
Recommended publications
  • Large-Scale Learning from Data Streams with Apache SAMOA
    Large-Scale Learning from Data Streams with Apache SAMOA Nicolas Kourtellis1, Gianmarco De Francisci Morales2, and Albert Bifet3 1 Telefonica Research, Spain, [email protected] 2 Qatar Computing Research Institute, Qatar, [email protected] 3 LTCI, Télécom ParisTech, France, [email protected] Abstract. Apache SAMOA (Scalable Advanced Massive Online Anal- ysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical soft- ware tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of dis- tributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It fea- tures a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software Li- cense version 2.0. 1 Introduction Big data are “data whose characteristics force us to look beyond the traditional methods that are prevalent at the time” [18]. For instance, social media are one of the largest and most dynamic sources of data. These data are not only very large due to their fine grain, but also being produced continuously. Furthermore, such data are nowadays produced by users in different environments and via a multitude of devices. For these reasons, data from social media and ubiquitous environments are perfect examples of the challenges posed by big data.
    [Show full text]
  • DSP Frameworks DSP Frameworks We Consider
    Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP frameworks we consider • Apache Storm (with lab) • Twitter Heron – From Twitter as Storm and compatible with Storm • Apache Spark Streaming (lab) – Reduce the size of each stream and process streams of data (micro-batch processing) • Apache Flink • Apache Samza • Cloud-based frameworks – Google Cloud Dataflow – Amazon Kinesis Streams Valeria Cardellini - SABD 2017/18 1 Apache Storm • Apache Storm – Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter • Topology – DAG of spouts (sources of streams) and bolts (operators and data sinks) Valeria Cardellini - SABD 2017/18 2 Stream grouping in Storm • Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)? • Shuffle grouping – Randomly partitions the tuples • Field grouping – Hashes on a subset of the tuple attributes Valeria Cardellini - SABD 2017/18 3 Stream grouping in Storm • All grouping (i.e., broadcast) – Replicates the entire stream to all the consumer tasks • Global grouping – Sends the entire stream to a single task of a bolt • Direct grouping – The producer of the tuple decides which task of the consumer will receive this tuple Valeria Cardellini - SABD 2017/18 4 Storm architecture • Master-worker architecture Valeria Cardellini - SABD 2017/18 5 Storm
    [Show full text]
  • Comparative Analysis of Data Stream Processing Systems
    Shah Zeb Mian Comparative Analysis of Data Stream Processing Systems Master’s Thesis in Information Technology February 23, 2020 University of Jyväskylä Faculty of Information Technology Author: Shah Zeb Mian Contact information: [email protected] Supervisors: Oleksiy Khriyenko, and Vagan Terziyan Title: Comparative Analysis of Data Stream Processing Systems Työn nimi: Vertaileva analyysi Data Stream-käsittelyjärjestelmistä Project: Master’s Thesis Study line: All study lines Page count: 48+0 Abstract: Big data processing systems are evolving to be more stream oriented where data is processed continuously by processing it as soon as it arrives. Earlier data was often stored in a database, a file system or other form of data storage system. Applications would query the data as needed. Stram processing is the processing of data in motion. It works on continuous data retrieved from different resources. Instead of periodically collecting huge static data, streaming frameworks process data as soon as it becomes available, hence reducing latency. This thesis aims to conduct a comparative analysis of different streaming processors based on selected features. Research focuses on Apache Samza, Apache Flink, Apache Storm and Apache Spark Structured Streaming. Also, this thesis explains Apache Kafka which is a log-based data storage widely used in streaming frameworks. Keywords: Big Data, Stream Processing,Batch Processing,Streaming Engines, Apache Kafka, Apache Samza Suomenkielinen tiivistelmä: Big data-käsittelyjärjestelmät ovat tällä hetkellä kehittymässä stream-orientoituneiksi, eli data käsitellään heti saapuessaan. Perinteisemmin data säilöt- tiin tietokantaan, tiedostopohjaisesti tai muuhun tiedonsäilytysjärjestelmään, ja applikaatiot hakivat datan tarvittaessa. Stream-pohjainen järjestelmä käsittelee liikkuvaa dataa, jatkuva- aikaista dataa useasta lähteestä. Sen sijaan, että haetaan ajoittain dataa, stream-pohjaiset frameworkit pystyvät käsittelemään i dataa heti kun se on saatavilla, täten vähentäen viivettä.
    [Show full text]
  • Network Traffic Profiling and Anomaly Detection for Cyber Security
    Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Student number: 01309688 Supervisors: Prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Counselors: Prof. dr. Bruno Volckaert, dr. ir. Tim Wauters A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Master of Science in Information Engineering Technology Academic year: 2017-2018 Acknowledgements This thesis is the result of 4 months work and I would like to express my gratitude towards the people who have guided me throughout this process. First and foremost I’d like to thank my thesis advisors prof. dr. Bruno Volckaert and dr. ir. Tim Wauters. By virtue of their knowledge and clear communication, I was able to maintain a clear target. Secondly I would like to thank prof. dr. ir. Filip De Turck for providing me the opportunity to conduct research in this field with the IDLab research group. Special thanks to Andres Felipe Ocampo Palacio and dr. Marleen Denert are in order as well. Mr. Ocampo’s Phd research into big data processing for network traffic and the resulting framework are an integral part of this thesis. Ms. Denert has been the go-to member of the faculty staff for general advice and administrative dealings. The final token of gratitude I’d like to extend to my family and friends for their continued support during this process. Laurens D’hooge Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Supervisor(s): prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Abstract— This article is a short summary of the research findings of a creation of APT2.
    [Show full text]
  • Optimizing Resource Utilization in Distributed Computing Systems For
    THESE` DE DOCTORAT DE L’ETABLISSEMENT´ UNIVERSITE´ BOURGOGNE FRANCHE-COMTE´ PREPAR´ EE´ A` L’UNIVERSITE´ DE FRANCHE-COMTE´ Ecole´ doctorale n°37 Sciences Pour l’Ingenieur´ et Microtechniques Doctorat d’Informatique par ANTHONY NASSAR Optimizing Resource Utilization in Distributed Computing Systems for Automotive Applications Optimisation de l’utilisation des ressources dans les systemes` informatiques distribues´ pour les applications automobiles These` present´ ee´ et soutenue publiquement le 04-02-2021 a` Belfort, devant le Jury compose´ de : MR CERIN CHRISTOPHE Professeur a` l’Universite´ Sorbonne Paris Nord President´ MR CHBEIR RICHARD Professeur a` l’Universite´ de Pau et des Pays de l’Adour Rapporteur MME BENBERNOU SALIMA Professeur a` l’Universite´ Paris-Descartes Rapporteur MR MOSTEFAOUI AHMED Maˆıtre de conferences´ a` l’Universite´ de Franche-Comte´ Directeur de these` MR DESSABLES FRANC¸ OIS Ingenieur´ chez Groupe PSA Codirecteur de these` DOCTORAL THESIS OF THE UNIVERSITY BOURGOGNE FRANCHE-COMTE´ INSTITUTION PREPARED AT UNIVERSITE´ DE FRANCHE-COMTE´ Doctoral school n°37 Engineering Sciences and Microtechnologies Computer Science Doctorate by ANTHONY NASSAR Optimizing Resource Utilization in Distributed Computing Systems for Automotive Applications Optimisation de l’utilisation des ressources dans les systemes` informatiques distribues´ pour les applications automobiles Thesis presented and publicly defended in Belfort, on 04-02-2021 Composition of the Jury : CERIN CHRISTOPHE Professor at Universite´ Sorbonne Paris Nord President
    [Show full text]
  • Storage and Ingestion Systems in Support of Stream Processing
    Storage and Ingestion Systems in Support of Stream Processing: A Survey Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María Pérez-Hernández, Radu Tudoran, Stefano Bortoli, Bogdan Nicolae To cite this version: Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María Pérez-Hernández, Radu Tudoran, et al.. Storage and Ingestion Systems in Support of Stream Processing: A Survey. [Technical Report] RT-0501, INRIA Rennes - Bretagne Atlantique and University of Rennes 1, France. 2018, pp.1-33. hal-01939280v2 HAL Id: hal-01939280 https://hal.inria.fr/hal-01939280v2 Submitted on 14 Dec 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Storage and Ingestion Systems in Support of Stream Processing: A Survey Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María S. Pérez-Hernández, Radu Tudoran, Stefano Bortoli, Bogdan Nicolae TECHNICAL REPORT N° 0501 November 2018 Project-Team KerData ISSN 0249-0803 ISRN INRIA/RT--0501--FR+ENG Storage and Ingestion Systems in Support of Stream Processing: A Survey Ovidiu-Cristian Marcu∗, Alexandru
    [Show full text]
  • A Study of Incremental Checkpointing in Distributed Stream Processing Systems
    A Study of Incremental Checkpointing in Distributed Stream Processing Systems A Thesis submitted to the designated by the General Assembly of Special Composition of the Department of Computer Science and Engineering Examination Committee by Aristidis Chronarakis in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN COMPUTER SCIENCE WITH SPECIALIZATION IN COMPUTER SYSTEMS University of Ioannina 2019 Examining Committee: • Kostas Magoutis, Assistant Professor, Department of Computer Science and Engineering, University of Ioannina (Supervisor) • Vassilios V. Dimakopoulos, Associate Professor, Department of Computer Sci- ence and Engineering, University of Ioannina • Evaggelia Pitoura, Professor, Department of Computer Science and Engineer- ing, University of Ioannina Dedication Dedicated to my family. Acknowledgements I would like to thank my advisor Prof. Kostas Magoutis for his guidance and support throughout my studies on the department, from the undergraduate level till the graduate. Special thanks to Prof. Vassilios Dimakopoulos and Prof. Evaggelia Pitoura for their participation as members of the examination committee. Finally, I would like to thank my family for the support and my friends for all the good moments we spent. Table of Contents List of Figures iii Abstract v Εκτεταμένη Περίληψη vi 1 Introduction 1 1.1 Objectives ................................... 2 1.2 Structure of this dissertation ......................... 3 2 Background 4 2.1 General concepts ............................... 4 2.2 Checkpoint-rollback methodology ..................... 7 2.3 Continuous eventual checkpointing (CEC) ................. 8 2.4 Apache Samza ................................ 9 2.4.1 Streams ................................ 9 2.4.2 Applications, Tasks, Containers ................... 10 2.4.3 State .................................. 11 2.4.4 Fault tolerance of stateful applications ............... 12 2.4.5 Message (tuple) replay and semantics ..............
    [Show full text]
  • Evaluating the Impact of Streaming Systems Design on Application Performance Alessio Pagliari
    Evaluating the impact of streaming systems design on application performance Alessio Pagliari To cite this version: Alessio Pagliari. Evaluating the impact of streaming systems design on application performance. Data Structures and Algorithms [cs.DS]. Université Côte d’Azur, 2021. English. NNT : 2021COAZ4011. tel-03273377 HAL Id: tel-03273377 https://tel.archives-ouvertes.fr/tel-03273377 Submitted on 29 Jun 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE DE DOCTORAT Évaluer l'impact de la conception des systèmes de streaming sur la performance des applications Alessio PAGLIARI Laboratoire d’Informatique, Signaux et Systèmes de Sophia Antipolis (I3S) Présentée en vue de l’obtention Devant le jury, composé de : du grade de docteur en Informatique Jean-Marc Pierson, Professeur, Université Paul Sabatier Toulouse 3 d’Université Côte d’Azur Guillaume Pierre, Professeur, Université de Rennes 1 Pietro Michiardi, Professeur, Eurecom Dirigée par : Fabrice Huet / Fabrice Huet, Professeur, Université Côte d’Azur Guillaume Urvoy-Keller, Professeur,
    [Show full text]
  • An Evaluation of Real-Time Processing of Call Detail Records Using Stream Processing
    UNIVERSITY OF NAIROBI COLLEGE OF BIOLOGICAL AND PHYSICAL SCIENCES SCHOOL OF COMPUTING AND INFORMATICS An Evaluation of Real-Time Processing of Call Detail Records Using Stream Processing CATHERINE KITHUSI WAMBUA P53/73389/2014 A research project report submitted to the School of Computing and Informatics in partial fulfillment of the requirements for the award of the Degree of Masters of Science in Distributed Computing Technology at the University of Nairobi December 2017. DECLARATION I certify that this research project report to the best of my knowledge, is my original authorial work except as acknowledged therein and has not been submitted for any other degree or professional qualification award in this or any other University. Signature: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Catherine Kithusi Wambua (P53/73389/2014) This research report has been submitted in partial fulfillment of the requirements for the Degree of Master of Science in Distributed Computing Technology at the University of Nairobi with my approval as the University supervisor. Signature: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Dr. Christopher Chepken i | P a g e DEDICATION To my beloved parents, for their unrelenting dedication to ensuring that my siblings and I acquired the best education despite all odds.
    [Show full text]
  • Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"
    Appendix to the paper "Code smell prediction employing machine learning meets emerging Java language constructs" Hanna Grodzicka, Michał Kawa, Zofia Łakomiak, Arkadiusz Ziobrowski, Lech Madeyski (B) The Appendix includes two tables containing the dataset used in the paper "Code smell prediction employing machine learning meets emerging Java lan- guage constructs". The first table contains information about 792 projects selected for R package reproducer [Madeyski and Kitchenham(2019)]. Projects were the base dataset for cre- ating the dataset used in the study (Table I). The second table contains information about 281 projects filtered by Java version from build tool Maven (Table II) which were directly used in the paper. TABLE I: Base projects used to create the new dataset # Orgasation Project name GitHub link Commit hash Build tool Java version 1 adobe aem-core-wcm- www.github.com/adobe/ 1d1f1d70844c9e07cd694f028e87f85d926aba94 other or lack of unknown components aem-core-wcm-components 2 adobe S3Mock www.github.com/adobe/ 5aa299c2b6d0f0fd00f8d03fda560502270afb82 MAVEN 8 S3Mock 3 alexa alexa-skills- www.github.com/alexa/ bf1e9ccc50d1f3f8408f887f70197ee288fd4bd9 MAVEN 8 kit-sdk-for- alexa-skills-kit-sdk- java for-java 4 alibaba ARouter www.github.com/alibaba/ 93b328569bbdbf75e4aa87f0ecf48c69600591b2 GRADLE unknown ARouter 5 alibaba atlas www.github.com/alibaba/ e8c7b3f1ff14b2a1df64321c6992b796cae7d732 GRADLE unknown atlas 6 alibaba canal www.github.com/alibaba/ 08167c95c767fd3c9879584c0230820a8476a7a7 MAVEN 7 canal 7 alibaba cobar www.github.com/alibaba/
    [Show full text]
  • A Novel Cloud Broker-Based Resource Elasticity Management and Pricing for Big Data Streaming Applications
    A Novel Cloud Broker-based Resource Elasticity Management and Pricing for Big Data Streaming Applications by Olubisi A. Runsewe Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electronic Business School of Electrical Engineering and Computer Science Faculty of Engineering University of Ottawa c Olubisi A. Runsewe, Ottawa, Canada, 2019 Abstract The pervasive availability of streaming data from various sources is driving todays' enterprises to acquire low-latency big data streaming applications (BDSAs) for extracting useful information. In parallel, recent advances in technology have made it easier to collect, process and store these data streams in the cloud. For most enterprises, gaining insights from big data is immensely important for maintaining competitive advantage. However, majority of enterprises have difficulty managing the multitude of BDSAs and the complex issues cloud technologies present, giving rise to the incorporation of cloud service brokers (CSBs). Generally, the main objective of the CSB is to maintain the heterogeneous quality of service (QoS) of BDSAs while minimizing costs. To achieve this goal, the cloud, al- though with many desirable features, exhibits major challenges | resource prediction and resource allocation | for CSBs. First, most stream processing systems allocate a fixed amount of resources at runtime, which can lead to under- or over-provisioning as BDSA demands vary over time. Thus, obtaining optimal trade-off between QoS violation and cost requires accurate demand prediction methodology to prevent waste, degradation or shutdown of processing. Second, coordinating resource allocation and pricing decisions for self-interested BDSAs to achieve fairness and efficiency can be complex.
    [Show full text]
  • Parte I Studio Delle Tecnologie Utili Per L'analisi, L'elaborazione E L'interrogazione Di Big Data
    UNIVERSITÀ DEGLI STUDI DI MODENA E REGGIO EMILIA Dipartimento di Scienze Fisiche, Informatiche e Matematiche Corso di Laurea in Informatica Titolo Tesi Progettazione e sviluppo di un’applicazione Big Data per l’analisi e l’elaborazione di tweet in real-time RELATORE CANDIDATO Chiar.mo Professore Alessandro Pillo Riccardo Martoglia MATR. 111759 Anno Accademico 2018/2019 Indice Introduzione ………………………………….………………………………… pag. 1 Parte I Studio delle tecnologie utili per l’analisi, l’elaborazione e l’interrogazione di Big Data Capitolo I - “Ecosistema Hadoop” 1.1 Big Data ….………………………………….……………….………… pag. 3 1.2 Analisi derivate dalla figura ………………….…………….…….…… pag. 5 1.3 Processing layer …………………….……………………….………….. pag. 6 1.4 Distributed data processing & programming .…………….….…………. pag. 7 1.5 Sistemi analizzati in tabella .…………….……………..…….……..…… pag. 11 1.5.1 Apache Hadoop ……….….……………..………………………. pag. 11 1.5.2 Apache Apex ……….……….…………..………..……….….….. pag. 17 1.5.3 Apache Beam ……….…………………..…………………….…. pag. 20 1.5.4 Apache Flink……….…………………………………………..… pag. 24 1.5.5 Apache Samza …….……………..……………..……………….. pag. 30 1.5.6 Apache Spark .…….……………..……………………………… pag. 34 1.5.7 Apache Storm…….……………..……………………………….. pag. 38 1.5.8 Apache Tez .………….……………..……………………….…… pag. 40 1.5.9 Google MillWheel….……………..…………………………….. pag. 41 1.5.10 Google Cloud Dataflow…………..……………………………… pag. 41 1.5.11 IBM InfoSphere Streams ..………..………………………..…… pag. 43 1.5.12 Twitter Heron ………..………………..………………………… pag. 44 !I Capitolo II - “Machine Learning” 2.1 Introduzione al ML ………………………….………………………….. pag. 46 2.2 Algoritmi di machine learning …….…………………………………….. pag. 46 2.2.1 Training e Test Dataset ……….………………………………….. pag. 46 2.2.2 Fitting del modello: underfitting e overfitting ……………….. pag. 47 2.2.3 Apprendimento ………………………………………………….. pag. 49 2.3 Machine learning: “tradizionale” e “online” ………………………….… pag. 50 Parte II Studio di un’applicazione reale Capitolo III - “Applicazione reale: progetto” 3.1 Descrizione della realtà da analizzare ………………………………..….
    [Show full text]