Kafka: the Definitive Guide Real-Time Data and Stream Processing at Scale

Total Page:16

File Type:pdf, Size:1020Kb

Kafka: the Definitive Guide Real-Time Data and Stream Processing at Scale SECOND EDITION Kafka: The Definitive Guide Real-Time Data and Stream Processing at Scale With Early Release ebooks, you get books in their earliest form—the authors’ raw and unedited content as they write— so you can take advantage of these technologies long before the official release of these titles. Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty Beijing Boston Farnham Sebastopol Tokyo Kafka: The Definitive Guide by Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty Copyright © 2022 Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jess Haberman Interior Designer: David Futato Development Editor: Gary O’Brien Cover Designer: Karen Montgomery Production Editor: Kate Galloway Illustrator: Kate Dullea July 2017: First Edition October 2021: Second Edition Revision History for the Early Release 2020-05-22: First Release 2020-06-22: Second Release 2020-07-22: Third Release 2020-09-01: Fourth Release 2020-10-21: Fifth Release 2020-11-20: Sixth Release 2021-02-04: Seventh Release 2021-03-29: Eighth Release 2021-04-13: Ninth Release 2021-06-15: Tenth Release 2021-07-20: Eleventh Release See http://oreilly.com/catalog/errata.csp?isbn=9781492043089 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Kafka: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-04301-0 Table of Contents 1. Meet Kafka. 1 Publish/Subscribe Messaging 2 How It Starts 2 Individual Queue Systems 3 Enter Kafka 4 Messages and Batches 4 Schemas 5 Topics and Partitions 5 Producers and Consumers 6 Brokers and Clusters 7 Multiple Clusters 9 Why Kafka? 10 Multiple Producers 10 Multiple Consumers 10 Disk-Based Retention 11 Scalable 11 High Performance 11 The Data Ecosystem 11 Use Cases 12 Kafka’s Origin 14 LinkedIn’s Problem 14 The Birth of Kafka 15 Open Source 15 Commercial Engagement 16 The Name 16 Getting Started with Kafka 16 v 2. Installing Kafka. 17 Environment Setup 17 Choosing an Operating System 18 Installing Java 18 Installing Zookeeper 18 Installing a Kafka Broker 21 Broker Configuration 22 General Broker 23 Topic Defaults 25 Hardware Selection 31 Disk Throughput 32 Disk Capacity 32 Memory 33 Networking 33 CPU 33 Kafka in the Cloud 34 Kafka Clusters 34 How Many Brokers? 35 Broker Configuration 36 OS Tuning 36 Production Concerns 40 Garbage Collector Options 40 Datacenter Layout 41 Colocating Applications on Zookeeper 42 Summary 45 3. Kafka Producers: Writing Messages to Kafka. 47 Producer Overview 48 Constructing a Kafka Producer 50 Sending a Message to Kafka 53 Sending a Message Synchronously 53 Sending a Message Asynchronously 54 Configuring Producers 55 client.id 56 acks 56 Message Delivery Time 57 linger.ms 60 compression.type 60 batch.size 60 max.in.flight.requests.per.connection 61 max.request.size 61 receive.buffer.bytes and send.buffer.bytes 62 vi | Table of Contents enable.idempotence 62 Serializers 62 Custom Serializers 63 Serializing Using Apache Avro 65 Using Avro Records with Kafka 66 Partitions 69 Headers 72 Interceptors 73 Quotas and Throttling 75 Summary 77 4. Kafka Consumers: Reading Data from Kafka. 79 Kafka Consumer Concepts 79 Consumers and Consumer Groups 80 Consumer Groups and Partition Rebalance 83 Static Group Membership 86 Creating a Kafka Consumer 87 Subscribing to Topics 88 The Poll Loop 89 Configuring Consumers 92 fetch.min.bytes 92 fetch.max.wait.ms 92 fetch.max.bytes 92 max.poll.records 93 max.partition.fetch.bytes 93 session.timeout.ms and heartbeat.interval.ms 93 max.poll.interval.ms 94 default.api.timeout.ms 94 request.timeout.ms 94 auto.offset.reset 94 enable.auto.commit 95 partition.assignment.strategy 95 client.id 96 client.rack 96 group.instance.id 97 receive.buffer.bytes and send.buffer.bytes 97 offsets.retention.minutes 97 Commits and Offsets 97 Automatic Commit 99 Commit Current Offset 99 Asynchronous Commit 100 Combining Synchronous and Asynchronous Commits 102 Table of Contents | vii Commit Specified Offset 103 Rebalance Listeners 104 Consuming Records with Specific Offsets 107 But How Do We Exit? 109 Deserializers 111 Custom deserializers 111 Using Avro deserialization with Kafka consumer 114 Standalone Consumer: Why and How to Use a Consumer Without a Group 115 Summary 116 5. Managing Apache Kafka Programmatically. 117 AdminClient Overview 118 Asynchronous and Eventually Consistent API 118 Options 119 Flat Hierarchy 119 Additional Notes 119 AdminClient Lifecycle: Creating, Configuring and Closing 120 client.dns.lookup 120 request.timeout.ms 121 Essential Topic Management 122 Configuration management 126 Consumer group management 127 Exploring Consumer Groups 128 Modifying consumer groups 129 Cluster Metadata 131 Advanced Admin Operations 131 Adding partitions to a topic 131 Deleting records from a topic 132 Leader Election 132 Reassigning Replicas 134 Testing 135 Summary 137 6. Kafka Internals. 139 Cluster Membership 140 The Controller 140 KRaft - Kafka’s new Raft based controller 142 Replication 143 Request Processing 146 Produce Requests 148 Fetch Requests 149 Other Requests 151 viii | Table of Contents Physical Storage 153 Tiered Storage 153 Partition Allocation 155 File Management 156 File Format 157 Indexes 159 Compaction 161 How Compaction Works 161 Deleted Events 163 When Are Topics Compacted? 164 Summary 164 7. Reliable Data Delivery. 165 Reliability Guarantees 166 Replication 167 Broker Configuration 168 Replication Factor 169 Unclean Leader Election 170 Minimum In-Sync Replicas 171 Keeping Replicas In Sync 172 Persisting to disk 173 Using Producers in a Reliable System 173 Send Acknowledgments 174 Configuring Producer Retries 175 Additional Error Handling 175 Using Consumers in a Reliable System 176 Important Consumer Configuration Properties for Reliable Processing 177 Explicitly Committing Offsets in Consumers 178 Validating System Reliability 180 Validating Configuration 180 Validating Applications 181 Monitoring Reliability in Production 182 Summary 183 8. Exactly Once Semantics. 185 Idempotent Producer 186 How Does Idempotent Producer Work? 186 Limitations of the idempotent producer 189 How do I use Kafka idempotent producer? 189 Transactions 190 Use-Cases 191 What problems do Transactions solve? 191 Table of Contents | ix How Do Transactions Guarantee Exactly Once? 192 What problems aren’t solved by Transactions? 195 How Do I Use Transactions? 197 Transactional IDs and Fencing 200 How Transactions Work 202 Performance of Transactions 204 Summary 205 9. Building Data Pipelines. 207 Considerations When Building Data Pipelines 208 Timeliness 208 Reliability 209 High and Varying Throughput 209 Data Formats 210 Transformations 211 Security 211 Failure Handling 212 Coupling and Agility 213 When to Use Kafka Connect Versus Producer and Consumer 214 Kafka Connect 214 Running Connect 215 Connector Example: File Source and File Sink 217 Connector Example: MySQL to Elasticsearch 219 Single Message Transformations 226 A Deeper Look at Connect 228 Alternatives to Kafka Connect 232 Ingest Frameworks for Other Datastores 232 GUI-Based ETL Tools 232 Stream-Processing Frameworks 233 Summary 233 10. Cross-Cluster Data Mirroring. 235 Use Cases of Cross-Cluster Mirroring 237 Multicluster Architectures 238 Some Realities of Cross-Datacenter Communication 238 Hub-and-Spokes Architecture 239 Active-Active Architecture 241 Active-Standby Architecture 243 Stretch Clusters 249 Apache Kafka’s MirrorMaker 250 How to Configure 252 Multicluster replication topology 254 x | Table of Contents Securing MirrorMaker 255 Deploying MirrorMaker in Production 256 Tuning MirrorMaker 260 Other Cross-Cluster Mirroring Solutions 262 Uber uReplicator 262 LinkedIn Brooklin 263 Confluent Cross-Datacenter Mirroring Solutions 264 Summary 266 11. Securing Kafka. 267 Locking Down Kafka 268 Security Protocols 270 Authentication 271 SSL 272 SASL 277 Re-authentication 288 Security updates without downtime 290 Encryption 291 End-to-End Encryption 292 Authorization 293 AclAuthorizer 294 Customizing Authorization 297 Security Considerations 299 Auditing 300 Securing ZooKeeper 301 SASL 301 SSL 302 Authorization 302 Securing the Platform 303 Password Protection 303 Summary 305 12. Administering Kafka. 307 Topic Operations 308 Creating a New Topic 308 Listing All Topics in a Cluster 310 Describing Topic Details 310 Adding Partitions 312 Reducing Partitions 313 Deleting a Topic 313 Consumer Groups 314 List and Describe Groups
Recommended publications
  • Large-Scale Learning from Data Streams with Apache SAMOA
    Large-Scale Learning from Data Streams with Apache SAMOA Nicolas Kourtellis1, Gianmarco De Francisci Morales2, and Albert Bifet3 1 Telefonica Research, Spain, [email protected] 2 Qatar Computing Research Institute, Qatar, [email protected] 3 LTCI, Télécom ParisTech, France, [email protected] Abstract. Apache SAMOA (Scalable Advanced Massive Online Anal- ysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical soft- ware tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of dis- tributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It fea- tures a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software Li- cense version 2.0. 1 Introduction Big data are “data whose characteristics force us to look beyond the traditional methods that are prevalent at the time” [18]. For instance, social media are one of the largest and most dynamic sources of data. These data are not only very large due to their fine grain, but also being produced continuously. Furthermore, such data are nowadays produced by users in different environments and via a multitude of devices. For these reasons, data from social media and ubiquitous environments are perfect examples of the challenges posed by big data.
    [Show full text]
  • DSP Frameworks DSP Frameworks We Consider
    Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP frameworks we consider • Apache Storm (with lab) • Twitter Heron – From Twitter as Storm and compatible with Storm • Apache Spark Streaming (lab) – Reduce the size of each stream and process streams of data (micro-batch processing) • Apache Flink • Apache Samza • Cloud-based frameworks – Google Cloud Dataflow – Amazon Kinesis Streams Valeria Cardellini - SABD 2017/18 1 Apache Storm • Apache Storm – Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter • Topology – DAG of spouts (sources of streams) and bolts (operators and data sinks) Valeria Cardellini - SABD 2017/18 2 Stream grouping in Storm • Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)? • Shuffle grouping – Randomly partitions the tuples • Field grouping – Hashes on a subset of the tuple attributes Valeria Cardellini - SABD 2017/18 3 Stream grouping in Storm • All grouping (i.e., broadcast) – Replicates the entire stream to all the consumer tasks • Global grouping – Sends the entire stream to a single task of a bolt • Direct grouping – The producer of the tuple decides which task of the consumer will receive this tuple Valeria Cardellini - SABD 2017/18 4 Storm architecture • Master-worker architecture Valeria Cardellini - SABD 2017/18 5 Storm
    [Show full text]
  • Comparative Analysis of Data Stream Processing Systems
    Shah Zeb Mian Comparative Analysis of Data Stream Processing Systems Master’s Thesis in Information Technology February 23, 2020 University of Jyväskylä Faculty of Information Technology Author: Shah Zeb Mian Contact information: [email protected] Supervisors: Oleksiy Khriyenko, and Vagan Terziyan Title: Comparative Analysis of Data Stream Processing Systems Työn nimi: Vertaileva analyysi Data Stream-käsittelyjärjestelmistä Project: Master’s Thesis Study line: All study lines Page count: 48+0 Abstract: Big data processing systems are evolving to be more stream oriented where data is processed continuously by processing it as soon as it arrives. Earlier data was often stored in a database, a file system or other form of data storage system. Applications would query the data as needed. Stram processing is the processing of data in motion. It works on continuous data retrieved from different resources. Instead of periodically collecting huge static data, streaming frameworks process data as soon as it becomes available, hence reducing latency. This thesis aims to conduct a comparative analysis of different streaming processors based on selected features. Research focuses on Apache Samza, Apache Flink, Apache Storm and Apache Spark Structured Streaming. Also, this thesis explains Apache Kafka which is a log-based data storage widely used in streaming frameworks. Keywords: Big Data, Stream Processing,Batch Processing,Streaming Engines, Apache Kafka, Apache Samza Suomenkielinen tiivistelmä: Big data-käsittelyjärjestelmät ovat tällä hetkellä kehittymässä stream-orientoituneiksi, eli data käsitellään heti saapuessaan. Perinteisemmin data säilöt- tiin tietokantaan, tiedostopohjaisesti tai muuhun tiedonsäilytysjärjestelmään, ja applikaatiot hakivat datan tarvittaessa. Stream-pohjainen järjestelmä käsittelee liikkuvaa dataa, jatkuva- aikaista dataa useasta lähteestä. Sen sijaan, että haetaan ajoittain dataa, stream-pohjaiset frameworkit pystyvät käsittelemään i dataa heti kun se on saatavilla, täten vähentäen viivettä.
    [Show full text]
  • Network Traffic Profiling and Anomaly Detection for Cyber Security
    Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Student number: 01309688 Supervisors: Prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Counselors: Prof. dr. Bruno Volckaert, dr. ir. Tim Wauters A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Master of Science in Information Engineering Technology Academic year: 2017-2018 Acknowledgements This thesis is the result of 4 months work and I would like to express my gratitude towards the people who have guided me throughout this process. First and foremost I’d like to thank my thesis advisors prof. dr. Bruno Volckaert and dr. ir. Tim Wauters. By virtue of their knowledge and clear communication, I was able to maintain a clear target. Secondly I would like to thank prof. dr. ir. Filip De Turck for providing me the opportunity to conduct research in this field with the IDLab research group. Special thanks to Andres Felipe Ocampo Palacio and dr. Marleen Denert are in order as well. Mr. Ocampo’s Phd research into big data processing for network traffic and the resulting framework are an integral part of this thesis. Ms. Denert has been the go-to member of the faculty staff for general advice and administrative dealings. The final token of gratitude I’d like to extend to my family and friends for their continued support during this process. Laurens D’hooge Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Supervisor(s): prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Abstract— This article is a short summary of the research findings of a creation of APT2.
    [Show full text]
  • Projects – Other Than Hadoop! Created By:-Samarjit Mahapatra [email protected]
    Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] Mostly compatible with Hadoop/HDFS Apache Drill - provides low latency ad-hoc queries to many different data sources, including nested data. Inspired by Google's Dremel, Drill is designed to scale to 10,000 servers and query petabytes of data in seconds. Apache Hama - is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms. Akka - a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. ML-Hadoop - Hadoop implementation of Machine learning algorithms Shark - is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. Apache Crunch - Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run Azkaban - batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs Apache Mesos - is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.
    [Show full text]
  • Reflexión Académica En Diseño & Comunicación
    ISSN 1668-1673 XXXII • 2017 Año XVIII. Vol 32. Noviembre 2017. Buenos Aires. Argentina Reflexión Académica en Diseño & Comunicación IV Congreso de Creatividad, Diseño y Comunicación para Profesores y Autoridades de Nivel Medio. `Interfaces Palermo´ Reflexión Académica en Diseño y Comunicación Comité Editorial Universidad de Palermo. Lucia Acar. Universidade Estácio de Sá. Brasil. Facultad de Diseño y Comunicación. Gonzalo Javier Alarcón Vital. Universidad Autónoma Metropolitana. México. Centro de Estudios en Diseño y Comunicación. Mercedes Alfonsín. Universidad de Buenos Aires. Argentina. Mario Bravo 1050. Fernando Alberto Alvarez Romero. Pontificia Universidad Católica del C1175ABT. Ciudad Autónoma de Buenos Aires, Argentina. Ecuador. Ecuador. www.palermo.edu Gonzalo Aranda Toro. Universidad Santo Tomás. Chile. [email protected] Christian Atance. Universidad de Lomas de Zamora. Argentina. Mónica Balabani. Universidad de Palermo. Argentina. Director Alberto Beckers Argomedo. Universidad Santo Tomás. Chile. Oscar Echevarría Renato Antonio Bertao. Universidade Positivo. Brasil. Allan Castelnuovo. Market Research Society. Reino Unido. Coordinadora de la Publicación Jorge Manuel Castro Falero. Universidad de la Empresa. Uruguay. Diana Divasto Raúl Castro Zuñeda. Universidad de Palermo. Argentina. Michael Dinwiddie. New York University. USA. Mario Rubén Dorochesi Fernandois. Universidad Técnica Federico Santa María. Chile. Adriana Inés Echeverria. Universidad de la Cuenca del Plata. Argentina. Universidad de Palermo Jimena Mariana García Ascolani. Universidad Comunera. Paraguay. Rector Marcelo Ghio. Instituto San Ignacio. Perú. Ricardo Popovsky Clara Lucia Grisales Montoya. Academia Superior de Artes. Colombia. Haenz Gutiérrez Quintana. Universidad Federal de Santa Catarina. Brasil. Facultad de Diseño y Comunicación José Korn Bruzzone. Universidad Tecnológica de Chile. Chile. Decano Zulema Marzorati. Universidad de Buenos Aires. Argentina. Oscar Echevarría Denisse Morales.
    [Show full text]
  • PDF Download Scaling Big Data with Hadoop and Solr
    SCALING BIG DATA WITH HADOOP AND SOLR - PDF, EPUB, EBOOK Hrishikesh Vijay Karambelkar | 166 pages | 30 Apr 2015 | Packt Publishing Limited | 9781783553396 | English | Birmingham, United Kingdom Scaling Big Data with Hadoop and Solr - PDF Book The default duration between two heartbeats is 3 seconds. Some other SQL-based distributed query engines to certainly bear in mind and consider for your use cases are:. What Can We Help With? Check out some of the job opportunities currently listed that match the professional profile, many of which seek experience with Search and Solr. This mode can be turned off manually by running the following command:. Has the notion of parent-child document relationships These exist as separate documents within the index, limiting their aggregation functionality in deeply- nested data structures. This step will actually create an authorization key with ssh, bypassing the passphrase check as shown in the following screenshot:. Fields may be split into individual tokens and indexed separately. Any key starting with a will go in the first region, with c the third region and z the last region. Now comes the interesting part. After the jobs are complete, the results are returned to the remote client via HiveServer2. Finally, Hadoop can accept data in just about any format, which eliminates much of the data transformation involved with the data processing. The difference in ingestion performance between Solr and Rocana Search is striking. Aptude has been working with our team for the past four years and we continue to use them and are satisfied with their work Warren E. These tables support most of the common data types that you know from the relational database world.
    [Show full text]
  • Classifying, Evaluating and Advancing Big Data Benchmarks
    Classifying, Evaluating and Advancing Big Data Benchmarks Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität in Frankfurt am Main von Todor Ivanov aus Stara Zagora Frankfurt am Main 2019 (D 30) vom Fachbereich 12 Informatik der Johann Wolfgang Goethe-Universität als Dissertation angenommen. Dekan: Prof. Dr. Andreas Bernig Gutachter: Prof. Dott. -Ing. Roberto V. Zicari Prof. Dr. Carsten Binnig Datum der Disputation: 23.07.2019 Abstract The main contribution of the thesis is in helping to understand which software system parameters mostly affect the performance of Big Data Platforms under realistic workloads. In detail, the main research contributions of the thesis are: 1. Definition of the new concept of heterogeneity for Big Data Architectures (Chapter 2); 2. Investigation of the performance of Big Data systems (e.g. Hadoop) in virtual- ized environments (Section 3.1); 3. Investigation of the performance of NoSQL databases versus Hadoop distribu- tions (Section 3.2); 4. Execution and evaluation of the TPCx-HS benchmark (Section 3.3); 5. Evaluation and comparison of Hive and Spark SQL engines using benchmark queries (Section 3.4); 6. Evaluation of the impact of compression techniques on SQL-on-Hadoop engine performance (Section 3.5); 7. Extensions of the standardized Big Data benchmark BigBench (TPCx-BB) (Section 4.1 and 4.3); 8. Definition of a new benchmark, called ABench (Big Data Architecture Stack Benchmark), that takes into account the heterogeneity of Big Data architectures (Section 4.5). The thesis is an attempt to re-define system benchmarking taking into account the new requirements posed by the Big Data applications.
    [Show full text]
  • Storage and Ingestion Systems in Support of Stream Processing
    Storage and Ingestion Systems in Support of Stream Processing: A Survey Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María Pérez-Hernández, Radu Tudoran, Stefano Bortoli, Bogdan Nicolae To cite this version: Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María Pérez-Hernández, Radu Tudoran, et al.. Storage and Ingestion Systems in Support of Stream Processing: A Survey. [Technical Report] RT-0501, INRIA Rennes - Bretagne Atlantique and University of Rennes 1, France. 2018, pp.1-33. hal-01939280v2 HAL Id: hal-01939280 https://hal.inria.fr/hal-01939280v2 Submitted on 14 Dec 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Storage and Ingestion Systems in Support of Stream Processing: A Survey Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María S. Pérez-Hernández, Radu Tudoran, Stefano Bortoli, Bogdan Nicolae TECHNICAL REPORT N° 0501 November 2018 Project-Team KerData ISSN 0249-0803 ISRN INRIA/RT--0501--FR+ENG Storage and Ingestion Systems in Support of Stream Processing: A Survey Ovidiu-Cristian Marcu∗, Alexandru
    [Show full text]
  • A Study of Incremental Checkpointing in Distributed Stream Processing Systems
    A Study of Incremental Checkpointing in Distributed Stream Processing Systems A Thesis submitted to the designated by the General Assembly of Special Composition of the Department of Computer Science and Engineering Examination Committee by Aristidis Chronarakis in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN COMPUTER SCIENCE WITH SPECIALIZATION IN COMPUTER SYSTEMS University of Ioannina 2019 Examining Committee: • Kostas Magoutis, Assistant Professor, Department of Computer Science and Engineering, University of Ioannina (Supervisor) • Vassilios V. Dimakopoulos, Associate Professor, Department of Computer Sci- ence and Engineering, University of Ioannina • Evaggelia Pitoura, Professor, Department of Computer Science and Engineer- ing, University of Ioannina Dedication Dedicated to my family. Acknowledgements I would like to thank my advisor Prof. Kostas Magoutis for his guidance and support throughout my studies on the department, from the undergraduate level till the graduate. Special thanks to Prof. Vassilios Dimakopoulos and Prof. Evaggelia Pitoura for their participation as members of the examination committee. Finally, I would like to thank my family for the support and my friends for all the good moments we spent. Table of Contents List of Figures iii Abstract v Εκτεταμένη Περίληψη vi 1 Introduction 1 1.1 Objectives ................................... 2 1.2 Structure of this dissertation ......................... 3 2 Background 4 2.1 General concepts ............................... 4 2.2 Checkpoint-rollback methodology ..................... 7 2.3 Continuous eventual checkpointing (CEC) ................. 8 2.4 Apache Samza ................................ 9 2.4.1 Streams ................................ 9 2.4.2 Applications, Tasks, Containers ................... 10 2.4.3 State .................................. 11 2.4.4 Fault tolerance of stateful applications ............... 12 2.4.5 Message (tuple) replay and semantics ..............
    [Show full text]
  • Apache Samza
    Apache Samza Martin Kleppmann Definition vehicles, or the writes of records to a database. Apache Samza is an open source frame- Stream processing jobs are long- work for distributed processing of high- running processes that continuously volume event streams. Its primary design consume one or more event streams, goal is to support high throughput for a invoking some application logic on wide range of processing patterns, while every event, producing derived output providing operational robustness at the streams, and potentially writing output massive scale required by Internet com- to databases for subsequent querying. panies. Samza achieves this goal through While a batch process or a database a small number of carefully designed ab- query typically reads the state of a stractions: partitioned logs for messag- dataset at one point in time, and then ing, fault-tolerant local state, and cluster- finishes, a stream processor is never based task scheduling. finished: it continually awaits the arrival of new events, and it only shuts down when terminated by an administrator. Many tasks can be naturally ex- Overview pressed as stream processing jobs, for example: Stream processing is playing an increas- • aggregating occurrences of events, ingly important part of the data man- e.g., counting how many times a agement needs of many organizations. particular item has been viewed; Event streams can represent many kinds • computing the rate of certain events, of data, for example, the activity of users e.g., for system diagnostics, report- on a website, the movement of goods or ing, and abuse prevention; 1 2 Martin Kleppmann • enriching events with information the scalability of Samza is directly at- from a database, e.g., extending user tributable to the choice of these founda- click events with information about tional abstractions.
    [Show full text]
  • HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack
    HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack Geoffrey C. Fox, Judy Qiu, Supun Kamburugamuve Shantenu Jha, Andre Luckow School of Informatics and Computing RADICAL Indiana University Rutgers University Bloomington, IN 47408, USA Piscataway, NJ 08854, USA fgcf, xqiu, [email protected] [email protected], [email protected] Abstract—We review the High Performance Computing En- systems as they illustrate key capabilities and often motivate hanced Apache Big Data Stack HPC-ABDS and summarize open source equivalents. the capabilities in 21 identified architecture layers. These The software is broken up into layers so that one can dis- cover Message and Data Protocols, Distributed Coordination, Security & Privacy, Monitoring, Infrastructure Management, cuss software systems in smaller groups. The layers where DevOps, Interoperability, File Systems, Cluster & Resource there is especial opportunity to integrate HPC are colored management, Data Transport, File management, NoSQL, SQL green in figure. We note that data systems that we construct (NewSQL), Extraction Tools, Object-relational mapping, In- from this software can run interoperably on virtualized or memory caching and databases, Inter-process Communication, non-virtualized environments aimed at key scientific data Batch Programming model and Runtime, Stream Processing, High-level Programming, Application Hosting and PaaS, Li- analysis problems. Most of ABDS emphasizes scalability braries and Applications, Workflow and Orchestration. We but not performance and one of our goals is to produce summarize status of these layers focusing on issues of impor- high performance environments. Here there is clear need tance for data analytics. We highlight areas where HPC and for better node performance and support of accelerators like ABDS have good opportunities for integration.
    [Show full text]