Managing and Understanding Distributed Stream Processing

Total Page:16

File Type:pdf, Size:1020Kb

Managing and Understanding Distributed Stream Processing DISS. ETH No. 25928 Managing and understanding distributed stream processing A thesis submitted to attain the degree of Doctor of Sciences of ETH Zurich (Dr. sc. ETH Zurich) presented by Moritz Hoffmann MSc ETH in Computer Science, ETH Zurich born on January 3rd, 1986 citizen of the Federal Republic of Germany accepted on the recommendation of Prof. Dr. Timothy Roscoe (ETH Zurich), examiner Prof. Dr. Gustavo Alonso (ETH Zurich), co-examiner Prof. Dr. Peter Pietzuch (Imperial College London), co-examiner Dr. Frank McSherry (ETH Zurich), co-examiner 2019 Abstract Present-day computing systems have to deal with a continuous growth of data rate and volume. Processing these workloads should introduce as little latency as possible. Today’s stream processors promise to handle large volumes of data while providing low-latency query results. In practice however, their computational model, variations of the workload, and the lack of tools for programmers can lead to situations where the latency increases significantly. The reason for this lies in the design of today’s stream processing systems. Specifically, stream processors do not supply meaningful information for de- bugging root causes of latency problems. Additionally, they have inadequate controllers to automate resource management based on workload properties and requirements of the computation. Lastly, their reconfiguration mechanisms are not compatible with low-latency query processing and cause a significant latency increase while reconfiguring. Solving these problems is crucial to enable users, programmers and operators to maximize the effectiveness of stream processing. To solve these problems, we present three complementing solutions. We first propose a method to identify factors influencing query latency, using detailed internal measurements combined with knowledge of the computational model of the stream processor. Then, we use system-intrinsic measurements to design and implement an automatic scaling controller for scale-out distributed stream processors. It makes fast and accurate scaling decisions with minimum delay. Lastly, we propose a scaling mechanism that reduces downtime during reconfigurations by orders of magnitude. The mechanism achieves this by interleaving fine-grained configuration updates and data processing. The solutions presented in this thesis help designers of stream processors to better optimize for low-latency processing, and users to increase their query performance by providing better metrics and automating operational aspects. We think this is an important step towards efficient stream processing. i Zusammenfassung Die Datenmenge und die Datenfrequenz, mit der aktuelle Computersysteme umgehen müssen, nimmt ständig zu, während zugleich Abfragen auf diese Datenströme mit geringster Latenz bearbeitet werden sollen. Moderne Stream- Prozessoren versprechen zwar, große Datenvolumen mit geringer Abfragelatenz bearbeiten zu können. In der Realität führen die internen Strukturen der Daten- verarbeitung, Schwankungen der Arbeitslast und ein Mangel an Hilfsmitteln für Programmierer jedoch zu Situationen, in denen die Abfragelatenz signifikant ansteigt. Der Grund dafür liegt im Design moderner Stream-Prozessoren. Diese stel- len keine aussagekräftigen Informationen über den Status der Abfragen sowie den Ursprung von Latenzproblemen zur Verfügung. Ihnen fehlen Kontroll- mechanismen, die eine automatisierte Ressourcenzuordnung auf Grundlage der Arbeitslast und der Abfrageeigenschaften möglich machen könnten. Auch bieten sie keine Möglichkeiten, Konfigurationsänderungen ohne eine deutliche Erhöhung der Latenz anzuwenden. Um diese Probleme zu beheben, stellen wir drei sich ergänzende Lösungen vor. Als erstes zeigen wir eine Methode, die es unter Nutzung der internen Struk- tur des Stream-Prozessors und mit Hilfe detaillierter Messungen ermöglicht, genau zu identifizieren, wo Latenz entsteht. Auf der Basis von systemspezi- fischen Messungen entwickeln wir zweitens eine automatische Steuerung zur Skalierung verteilter Stream-Prozessoren, die schnell und präzise skaliert, und dabei die Abfragelatenz möglichst wenig erhöht. Schließlich stellen wir einen Mechanismus vor, der Konfigurationsänderungen in kleine Teile aufspaltet und während der Berechnung von Abfragen anwendet, anstatt der üblichen Methode, bei der die Berechnung angehalten und neu gestartet werden muss. Dieser Ansatz reduziert die Ausfallzeit während der Änderung von Konfigurationen stark. iii Zusammenfassung Die in dieser Arbeit präsentierten Lösungen helfen Entwicklern von Stream- Prozessoren, diese auf geringe Latenz zu optimieren, und erlauben Nutzern, die Latenz ihrer Anfragen durch bessere Messdaten und Automatisierung zu optimieren. Wir sind überzeugt, dass sie wichtige Beiträge zur Steigerung der Effizienz von Stream-Prozessoren darstellen. iv Acknowledgments The work presented in this dissertation would not have been possible without the collaboration and interactions with wonderful friends and colleagues. I would like to thank my advisor Timothy Roscoe for always supporting me, believing in me, and guiding me during the course of my doctoral studies. I would also like to thank Gustavo Alonso for co-advising my dissertation. Thanks to you, Peter Pietzuch, for taking part in my committee and the valuable feedback that you have given me. Thank you, Frank McSherry, for hours of interesting discussions, brain-storming sessions and your positive attitude to turn interesting ideas into successful research projects. During my internship I had the opportunity to collaborate with many incred- ibly smart collegues, which helped me to shape the structure of my dissertation: Adrian L. Shaw, Alexander Richardson, Chris Dalton, Dejan Milojicic, Geof- frey Ndu, Paolo Faraboschi, and Robert N. M. Watson. A special thanks goes to all friends and collaborators from the Systems Group and the Strymon project: Andrea, Claude, Desi, Gerd, Ghislain, Ingo, John, Lefteris, Lukas, Matthew, Michal, Pratanu, Pravin, Renato, Reto, Roni, Sebastian, Simon, Stefan, Vasia, and Zaheer. You all made it a pleasure to be part of the group during the last five years. Also thanks to all my friends for bearing with me during the time of writing, and especially to Florian for designing the great cover page. For many great things in my life, I would like to thank my parents Brigitte and Hartmut. Your support, motivation and courage helped me through all my studies and allowed me to pursue my path. Finally, special thanks to you, Marina, for standing by my side and your patience during difficult times. v Contents Abstracti Zusammenfassung iii Acknowledgmentsv 1 Introduction1 1.1 Structure of the dissertation . .3 1.2 Notes on collaborative work . .5 2 Background and motivation7 2.1 Stream processing concepts . .7 2.2 Expressing queries as dataflows . .8 2.3 A critical history of stream processing . 11 2.3.1 First generation stream processors . 12 2.3.2 Second generation stream processors . 12 2.3.3 MapReduce: large-scale computations . 15 2.3.4 Third-generation stream processors . 16 2.4 Timely dataflow concepts . 20 2.5 Related work: Distributed systems performance analysis . 21 2.6 Related work: Scaling controllers for distributed stream pro- cessors . 24 2.7 Related work: Applying configuration updates . 28 2.7.1 State migration in streaming systems . 29 2.7.2 Live migration in database systems . 31 2.7.3 Live migration for streaming dataflows . 31 3 Snailtrail 33 3.1 Critical path analysis background . 36 vii Contents 3.2 Online critical path analysis . 38 3.2.1 Transient critical paths . 38 3.2.2 Critical participation (CP metric) . 41 3.2.3 Comparison with existing methods . 44 3.3 Applicability to dataflow systems . 45 3.3.1 Activity types . 46 3.3.2 Instrumenting specific systems . 47 3.3.3 Model assumptions . 48 3.3.4 Instrumentation requirements . 49 3.4 Program activity graph construction . 50 3.5 Snailtrail system implementation . 53 3.6 CP-based performance summaries . 56 3.7 Evaluation . 59 3.7.1 Experimental setting . 59 3.7.2 Instrumentation overhead . 60 3.7.3 Snailtrail performance . 61 3.7.4 Comparison with existing methods . 62 3.7.5 Snailtrail in practice . 64 3.8 Critical participation: conclusion . 68 4 DS2: Controlling distributed streaming dataflows 69 4.1 Background and motivation . 71 4.2 The DS2 model . 72 4.2.1 Problem definition . 73 4.2.2 Performance model . 73 4.2.3 Assumptions . 79 4.2.4 Properties . 80 4.3 Implementation and deployment . 82 4.3.1 Instrumentation requirements . 82 4.3.2 Integration with stream processors . 83 4.3.3 DS2 and execution models . 86 4.4 Experimental evaluation . 87 4.4.1 Setup . 87 4.4.2 DS2 compared to Dhalion on Heron . 88 4.4.3 DS2 on Flink . 89 4.4.4 Convergence . 91 4.4.5 Accuracy . 92 4.4.6 Instrumentation overhead . 93 viii Contents 4.5 DS2: conclusion . 97 5 Megaphone 99 5.1 State migration design . 101 5.1.1 Migration formalism and guarantees . 101 5.1.2 Configuration updates . 103 5.1.3 Megaphone’s mechanism . 104 5.1.4 Example . 106 5.2 Implementation . 108 5.2.1 Megaphone’s operator interface . 109 5.2.2 State organization . 110 5.2.3 Timely Dataflow instantiation . 113 5.2.3.1 Monitoring output frontiers . 113 5.2.3.2 Capturing Timely Dataflow idioms . 113 5.2.4 Discussion . 114 5.3 Evaluation . 116 5.3.1 NEXMark benchmark . 117 5.3.2 Overhead of the interface . 124 5.3.3 Migration micro-benchmarks . 125 5.3.3.1 Number of bins vary . 129 5.3.3.2 Number of keys vary . 131 5.3.3.3 Number of keys and bins vary proportionally 132 5.3.3.4 Throughput versus processing latency . 133 5.3.3.5 Memory
Recommended publications
  • Network Traffic Profiling and Anomaly Detection for Cyber Security
    Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Student number: 01309688 Supervisors: Prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Counselors: Prof. dr. Bruno Volckaert, dr. ir. Tim Wauters A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Master of Science in Information Engineering Technology Academic year: 2017-2018 Acknowledgements This thesis is the result of 4 months work and I would like to express my gratitude towards the people who have guided me throughout this process. First and foremost I’d like to thank my thesis advisors prof. dr. Bruno Volckaert and dr. ir. Tim Wauters. By virtue of their knowledge and clear communication, I was able to maintain a clear target. Secondly I would like to thank prof. dr. ir. Filip De Turck for providing me the opportunity to conduct research in this field with the IDLab research group. Special thanks to Andres Felipe Ocampo Palacio and dr. Marleen Denert are in order as well. Mr. Ocampo’s Phd research into big data processing for network traffic and the resulting framework are an integral part of this thesis. Ms. Denert has been the go-to member of the faculty staff for general advice and administrative dealings. The final token of gratitude I’d like to extend to my family and friends for their continued support during this process. Laurens D’hooge Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Supervisor(s): prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Abstract— This article is a short summary of the research findings of a creation of APT2.
    [Show full text]
  • Optimizing Resource Utilization in Distributed Computing Systems For
    THESE` DE DOCTORAT DE L’ETABLISSEMENT´ UNIVERSITE´ BOURGOGNE FRANCHE-COMTE´ PREPAR´ EE´ A` L’UNIVERSITE´ DE FRANCHE-COMTE´ Ecole´ doctorale n°37 Sciences Pour l’Ingenieur´ et Microtechniques Doctorat d’Informatique par ANTHONY NASSAR Optimizing Resource Utilization in Distributed Computing Systems for Automotive Applications Optimisation de l’utilisation des ressources dans les systemes` informatiques distribues´ pour les applications automobiles These` present´ ee´ et soutenue publiquement le 04-02-2021 a` Belfort, devant le Jury compose´ de : MR CERIN CHRISTOPHE Professeur a` l’Universite´ Sorbonne Paris Nord President´ MR CHBEIR RICHARD Professeur a` l’Universite´ de Pau et des Pays de l’Adour Rapporteur MME BENBERNOU SALIMA Professeur a` l’Universite´ Paris-Descartes Rapporteur MR MOSTEFAOUI AHMED Maˆıtre de conferences´ a` l’Universite´ de Franche-Comte´ Directeur de these` MR DESSABLES FRANC¸ OIS Ingenieur´ chez Groupe PSA Codirecteur de these` DOCTORAL THESIS OF THE UNIVERSITY BOURGOGNE FRANCHE-COMTE´ INSTITUTION PREPARED AT UNIVERSITE´ DE FRANCHE-COMTE´ Doctoral school n°37 Engineering Sciences and Microtechnologies Computer Science Doctorate by ANTHONY NASSAR Optimizing Resource Utilization in Distributed Computing Systems for Automotive Applications Optimisation de l’utilisation des ressources dans les systemes` informatiques distribues´ pour les applications automobiles Thesis presented and publicly defended in Belfort, on 04-02-2021 Composition of the Jury : CERIN CHRISTOPHE Professor at Universite´ Sorbonne Paris Nord President
    [Show full text]
  • Evaluating the Impact of Streaming Systems Design on Application Performance Alessio Pagliari
    Evaluating the impact of streaming systems design on application performance Alessio Pagliari To cite this version: Alessio Pagliari. Evaluating the impact of streaming systems design on application performance. Data Structures and Algorithms [cs.DS]. Université Côte d’Azur, 2021. English. NNT : 2021COAZ4011. tel-03273377 HAL Id: tel-03273377 https://tel.archives-ouvertes.fr/tel-03273377 Submitted on 29 Jun 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE DE DOCTORAT Évaluer l'impact de la conception des systèmes de streaming sur la performance des applications Alessio PAGLIARI Laboratoire d’Informatique, Signaux et Systèmes de Sophia Antipolis (I3S) Présentée en vue de l’obtention Devant le jury, composé de : du grade de docteur en Informatique Jean-Marc Pierson, Professeur, Université Paul Sabatier Toulouse 3 d’Université Côte d’Azur Guillaume Pierre, Professeur, Université de Rennes 1 Pietro Michiardi, Professeur, Eurecom Dirigée par : Fabrice Huet / Fabrice Huet, Professeur, Université Côte d’Azur Guillaume Urvoy-Keller, Professeur,
    [Show full text]
  • Apache Samza
    Apache Samza Martin Kleppmann Definition vehicles, or the writes of records to a database. Apache Samza is an open source frame- Stream processing jobs are long- work for distributed processing of high- running processes that continuously volume event streams. Its primary design consume one or more event streams, goal is to support high throughput for a invoking some application logic on wide range of processing patterns, while every event, producing derived output providing operational robustness at the streams, and potentially writing output massive scale required by Internet com- to databases for subsequent querying. panies. Samza achieves this goal through While a batch process or a database a small number of carefully designed ab- query typically reads the state of a stractions: partitioned logs for messag- dataset at one point in time, and then ing, fault-tolerant local state, and cluster- finishes, a stream processor is never based task scheduling. finished: it continually awaits the arrival of new events, and it only shuts down when terminated by an administrator. Many tasks can be naturally ex- Overview pressed as stream processing jobs, for example: Stream processing is playing an increas- • aggregating occurrences of events, ingly important part of the data man- e.g., counting how many times a agement needs of many organizations. particular item has been viewed; Event streams can represent many kinds • computing the rate of certain events, of data, for example, the activity of users e.g., for system diagnostics, report- on a website, the movement of goods or ing, and abuse prevention; 1 2 Martin Kleppmann • enriching events with information the scalability of Samza is directly at- from a database, e.g., extending user tributable to the choice of these founda- click events with information about tional abstractions.
    [Show full text]
  • A Novel Cloud Broker-Based Resource Elasticity Management and Pricing for Big Data Streaming Applications
    A Novel Cloud Broker-based Resource Elasticity Management and Pricing for Big Data Streaming Applications by Olubisi A. Runsewe Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electronic Business School of Electrical Engineering and Computer Science Faculty of Engineering University of Ottawa c Olubisi A. Runsewe, Ottawa, Canada, 2019 Abstract The pervasive availability of streaming data from various sources is driving todays' enterprises to acquire low-latency big data streaming applications (BDSAs) for extracting useful information. In parallel, recent advances in technology have made it easier to collect, process and store these data streams in the cloud. For most enterprises, gaining insights from big data is immensely important for maintaining competitive advantage. However, majority of enterprises have difficulty managing the multitude of BDSAs and the complex issues cloud technologies present, giving rise to the incorporation of cloud service brokers (CSBs). Generally, the main objective of the CSB is to maintain the heterogeneous quality of service (QoS) of BDSAs while minimizing costs. To achieve this goal, the cloud, al- though with many desirable features, exhibits major challenges | resource prediction and resource allocation | for CSBs. First, most stream processing systems allocate a fixed amount of resources at runtime, which can lead to under- or over-provisioning as BDSA demands vary over time. Thus, obtaining optimal trade-off between QoS violation and cost requires accurate demand prediction methodology to prevent waste, degradation or shutdown of processing. Second, coordinating resource allocation and pricing decisions for self-interested BDSAs to achieve fairness and efficiency can be complex.
    [Show full text]
  • Configuring Data Stream Processing
    Numéro National de Thèse : 2019LYSEN050 THESE de DOCTORAT DE L’UNIVERSITE DE LYON opérée par l’École Normale Supérieure de Lyon École Doctorale N◦512 Ecole Doctorale en Informatique et Mathématiques de Lyon Discipline : Informatique Soutenue publiquement le 23/09/2019, par : Alexandre DA SILVA VEITH Quality of Service Aware Mechanisms for (Re)Configuring Data Stream Processing Applications on Highly Distributed Infrastructure Mécanismes prenant en compte la qualité de service pour la (re)configuration d’applications de traitement de flux de données sur une infrastructure hautement distribuée Devant le jury composé de : Omer RANA Professeur Université de Cardiff (UK) Rapporteur Patricia STOLF Maître de Conférences Université Paul Sabatier Rapporteure Hélène COULLON Maître de Conférences IMT Atlantique Examinatrice Frédéric DESPREZ Directeur de Recherche Inria Examinateur Guillaume PIERRE Professeur Université Rennes 1 Examinateur Laurent LEFEVRE Chargé de Recherche Inria Directeur Marcos DIAS DE ASSUNCAO Chargé de Recherche Inria Co-encadrant ii Contents Acknowledgments vii French Abstract ix 1 Introduction 1 1.1 Challenges in Data Stream Processing (DSP) Applications Deployment . .3 1.1.1 Challenges in Edge Computing . .4 1.1.2 Challenges in Cloud Computing . .4 1.1.3 Challenges in DSP Operator Placement . .4 1.1.4 Challenges in DSP Application Reconfiguration . .4 1.2 Research Problem and Objectives . .5 1.3 Evaluation Methodology . .5 1.4 Thesis Contribution . .6 1.5 Thesis Organisation . .6 2 State-of-the-art and Positioning 9 2.1 Introduction . .9 2.2 Data Stream Processing Architecture and Elasticity . 10 2.2.1 Online Data Processing Architecture . 10 2.2.2 Data Streams and Models .
    [Show full text]
  • A Comprehensive Solution for Research-Oriented Cloud Computing
    A Comprehensive Solution for Research-Oriented Cloud Computing B Mevlut A. Demir( ), Weslyn Wagner, Divyaansh Dandona, and John J. Prevost The University of Texas at San Antonio, One UTSA Circle, San Antonio, TX 78249, USA {mevlut.demir,jeff.prevost}@utsa.edu, {weslyn.wagner,ubm700}@my.utsa.edu Abstract. Cutting edge research today requires researchers to perform computationally intensive calculations and/or create models and simu- lations using large sums of data in order to reach research-backed con- clusions. As datasets, models, and calculations increase in size and scope they present a computational and analytical challenge to the researcher. Advances in cloud computing and the emergence of big data analytic tools are ideal to aid the researcher in tackling this challenge. Although researchers have been using cloud-based software services to propel their research, many institutions have not considered harnessing the Infras- tructure as a Service model. The reluctance to adopt Infrastructure as a Service in academia can be attributed to many researchers lacking the high degree of technical experience needed to design, procure, and manage custom cloud-based infrastructure. In this paper, we propose a comprehensive solution consisting of a fully independent cloud automa- tion framework and a modular data analytics platform which will allow researchers to create and utilize domain specific cloud solutions irrespec- tive of their technical knowledge, reducing the overall effort and time required to complete research. Keywords: Automation · Cloud computing · HPC Scientific computing · SolaaS 1 Introduction The cloud is an ideal data processing platform because of its modularity, effi- ciency, and many possible configurations. Nodes in the cloud can be dynamically spawned, configured, modified, and destroyed as needed by the application using pre-defined application programming interfaces (API).
    [Show full text]
  • Configuring Data Stream Processing Applications on Highly Distributed Infrastructure Alexandre Da Silva Veith
    Quality of Service Aware Mechanisms for (Re)Configuring Data Stream Processing Applications on Highly Distributed Infrastructure Alexandre da Silva Veith To cite this version: Alexandre da Silva Veith. Quality of Service Aware Mechanisms for (Re)Configuring Data Stream Processing Applications on Highly Distributed Infrastructure. Distributed, Parallel, and Cluster Com- puting [cs.DC]. Université de Lyon, 2019. English. NNT : 2019LYSEN050. tel-02385744v2 HAL Id: tel-02385744 https://hal.inria.fr/tel-02385744v2 Submitted on 20 Jan 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Numéro National de Thèse : 2019LYSEN050 THESE de DOCTORAT DE L’UNIVERSITE DE LYON opérée par l’École Normale Supérieure de Lyon École Doctorale N◦512 Ecole Doctorale en Informatique et Mathématiques de Lyon Discipline : Informatique Soutenue publiquement le 23/09/2019, par : Alexandre DA SILVA VEITH Quality of Service Aware Mechanisms for (Re)Configuring Data Stream Processing Applications on Highly Distributed Infrastructure Mécanismes prenant en compte la qualité de service
    [Show full text]
  • A Survey of State Management in Big Data Processing Systems
    A Survey of State Management in Big Data Processing Systems Quoc-Cuong To1 Juan Soto1,2 Volker Markl1,2 1 German Research Center for 2 Technische Universität Berlin Artificial Intelligence (DFKI) FG DIMA, Sekr. EN-7 Raum 728, Alt-Moabit 91c Einsteinufer 17 10559 Berlin, Germany 10587 Berlin, Germany [email protected] [email protected] [email protected] ABSTRACT graphs or trees (in the absence of iterations or shared results). From The concept of state and its applications vary widely across big data this perspective, the analysis results are the roots, operators are the processing systems. This is evident in both the research literature intermediate nodes, and data are the leaves. Each operator node and existing systems, such as Apache Flink, Apache Heron, Apache performs an operation that transforms inputs flowing through it into Samza, Apache Spark, and Apache Storm. Given the pivotal role outputs. Data flows from the leaves through the operator nodes to that state management plays, particularly, for iterative batch and the roots. stream processing, in this survey, we present examples of state as Operators come in two varieties. Stateless operators are an enabler, discuss the alternative approaches used to handle and purely functional and they produce output, solely based on their implement state, capture the many facets of state management, and input. Examples of stateless operators include relational selection, highlight new research directions. Our aim is to provide insight into relational projection without duplicate elimination, or merging two disparate state management techniques, motivate others to pursue inputs. In contrast, stateful operators compute their output on a research in this area, and draw attention to open problems.
    [Show full text]
  • The Evolution of Smart Stream Processing Architecture Enhancing the Kappa Architecture for Stateful Streams
    WHITEPAPER The Evolution of Smart Stream Processing Architecture Enhancing the Kappa Architecture for Stateful Streams Background Real world streaming use cases are pushing the boundaries of traditional data analytics. Up until now, the shift from bounded to unbounded was occurring at a much slower pace. 5G, however, is promising to revolutionize the connectivity of not just people but also things such as industrial machinery, residential appliances, medical devices, utility meters and more. By the year 2021, Gartner estimates that a total of 25.1 billion edge devices will be installed worldwide. With sub-millisecond latency guarantees and over 1MM devices/km2 stipulated in the 5G standard, companies are being forced to operate in a hyper-connected world. The capabilities of 5G will enable a new set of real-time applications that will now leverage the low latency ubiquitous networking to create many new use cases around Industrial & Medical IoT, AR/VR, immersive gaming and more. All this will put a massive load on traditional data management architectures. The primary reason for the break in existing architecture patterns stems from the fact that we are moving from managing and analyzing bounded data sets, to processing and making decisions on unbounded data streams. Typically, bound data has a known ending point and is relatively fixed. An example of bounded data would be subscriber account data at a Communications Service Provider (CSP). Analyzing and acting on bounded datasets is relatively easy. Traditional relational databases can solve most bounded data problems. On the other hand, unbound data is infinite and not always sequential. For example, monitoring temperature and pressure at various points in a manufac- turing facility is unbounded data.
    [Show full text]
  • A Quick and Flexible Stream Processing Application Prototype Generator Alessio Pagliari, Fabrice Huet, Guillaume Urvoy-Keller
    NAMB: A Quick and Flexible Stream Processing Application Prototype Generator Alessio Pagliari, Fabrice Huet, Guillaume Urvoy-Keller To cite this version: Alessio Pagliari, Fabrice Huet, Guillaume Urvoy-Keller. NAMB: A Quick and Flexible Stream Pro- cessing Application Prototype Generator. The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2020, Melbourne, Australia. hal-02483008 HAL Id: hal-02483008 https://hal.archives-ouvertes.fr/hal-02483008 Submitted on 18 Feb 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. NAMB: A Quick and Flexible Stream Processing Application Prototype Generator Alessio Pagliari, Fabrice Huet, Guillaume Urvoy-Keller Universite Cote d’Azur, CNRS, I3S falessio.pagliari, fabrice.huet, [email protected] Abstract—The importance of Big Data is nowadays established, implementation designs may be costly and time-consuming. both in industry and research fields, especially stream processing Current works commonly use test applications as benchmarks for its capability to analyze continuous data streams and provide or mocks of production applications. Most of the available test statistics in real-time. Several data stream processing (DSP) platforms exist like the Storm, Flink, Spark Streaming and applications tend to be bounded to the platform or scenario Heron Apache projects, or industrial products such as Google they are evaluated on.
    [Show full text]
  • DEPENDABLE CLOUD RESOURCES for BIG-DATA BATCH PROCESSING & STREAMING FRAMEWORKS by Bara Abusalah
    DEPENDABLE CLOUD RESOURCES FOR BIG-DATA BATCH PROCESSING & STREAMING FRAMEWORKS by Bara Abusalah A Dissertation Submitted to the Faculty of Purdue University In Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy School of Electrical and Computer Engineering West Lafayette, Indiana May 2021 THE PURDUE UNIVERSITY GRADUATE SCHOOL STATEMENT OF COMMITTEE APPROVAL Prof. Arif Ghafoor, Chair Electrical and Computer Engineering Department Prof. Patrick Eugster Computer Science Department Prof. Samuel Midkiff Electrical and Computer Engineering Department Prof. Walid Aref Computer Science Department Approved by: Prof. Dimitrios Peroulis 2 TABLE OF CONTENTS LIST OF FIGURES .................................... 7 ABSTRACT ........................................ 9 1 INTRODUCTION ................................... 11 1.1 Failures in Cloud Computing & Big Data Applications ............ 12 1.2 Observations/Motivations ............................ 13 Streaming & Batch Frameworks Recovery Time ............ 13 Limitations of Checkpointing ...................... 13 Redundant Work ............................. 14 Byzantine vs Crash Failures ....................... 15 Models of Consistency in Streaming Frameworks ........... 15 Cheaper Hardware & Idle Machines ................... 16 1.3 Objectives based on Observations ........................ 17 1.3.1 Multi Framework Approach ....................... 17 1.3.2 Minimum Recovery Time Possible ................... 18 1.3.3 Flexibility, Customizability & Reusability ............... 20
    [Show full text]