Kafka: the Definitive Guide Real-Time Data and Stream Processing at Scale

SECOND EDITION Kafka: The Definitive Guide Real-Time Data and Stream Processing at Scale With Early Release ebooks, you get books in their earliest form—the authors’ raw and unedited content as they write— so you can take advantage of these technologies long before the official release of these titles. Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty Beijing Boston Farnham Sebastopol Tokyo Kafka: The Definitive Guide by Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty Copyright © 2022 Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jess Haberman Interior Designer: David Futato Development Editor: Gary O’Brien Cover Designer: Karen Montgomery Production Editor: Kate Galloway Illustrator: Kate Dullea July 2017: First Edition October 2021: Second Edition Revision History for the Early Release 2020-05-22: First Release 2020-06-22: Second Release 2020-07-22: Third Release 2020-09-01: Fourth Release 2020-10-21: Fifth Release 2020-11-20: Sixth Release 2021-02-04: Seventh Release 2021-03-29: Eighth Release 2021-04-13: Ninth Release 2021-06-15: Tenth Release 2021-07-20: Eleventh Release See http://oreilly.com/catalog/errata.csp?isbn=9781492043089 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Kafka: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-04301-0 Table of Contents 1. Meet Kafka. 1 Publish/Subscribe Messaging 2 How It Starts 2 Individual Queue Systems 3 Enter Kafka 4 Messages and Batches 4 Schemas 5 Topics and Partitions 5 Producers and Consumers 6 Brokers and Clusters 7 Multiple Clusters 9 Why Kafka? 10 Multiple Producers 10 Multiple Consumers 10 Disk-Based Retention 11 Scalable 11 High Performance 11 The Data Ecosystem 11 Use Cases 12 Kafka’s Origin 14 LinkedIn’s Problem 14 The Birth of Kafka 15 Open Source 15 Commercial Engagement 16 The Name 16 Getting Started with Kafka 16 v 2. Installing Kafka. 17 Environment Setup 17 Choosing an Operating System 18 Installing Java 18 Installing Zookeeper 18 Installing a Kafka Broker 21 Broker Configuration 22 General Broker 23 Topic Defaults 25 Hardware Selection 31 Disk Throughput 32 Disk Capacity 32 Memory 33 Networking 33 CPU 33 Kafka in the Cloud 34 Kafka Clusters 34 How Many Brokers? 35 Broker Configuration 36 OS Tuning 36 Production Concerns 40 Garbage Collector Options 40 Datacenter Layout 41 Colocating Applications on Zookeeper 42 Summary 45 3. Kafka Producers: Writing Messages to Kafka. 47 Producer Overview 48 Constructing a Kafka Producer 50 Sending a Message to Kafka 53 Sending a Message Synchronously 53 Sending a Message Asynchronously 54 Configuring Producers 55 client.id 56 acks 56 Message Delivery Time 57 linger.ms 60 compression.type 60 batch.size 60 max.in.flight.requests.per.connection 61 max.request.size 61 receive.buffer.bytes and send.buffer.bytes 62 vi | Table of Contents enable.idempotence 62 Serializers 62 Custom Serializers 63 Serializing Using Apache Avro 65 Using Avro Records with Kafka 66 Partitions 69 Headers 72 Interceptors 73 Quotas and Throttling 75 Summary 77 4. Kafka Consumers: Reading Data from Kafka. 79 Kafka Consumer Concepts 79 Consumers and Consumer Groups 80 Consumer Groups and Partition Rebalance 83 Static Group Membership 86 Creating a Kafka Consumer 87 Subscribing to Topics 88 The Poll Loop 89 Configuring Consumers 92 fetch.min.bytes 92 fetch.max.wait.ms 92 fetch.max.bytes 92 max.poll.records 93 max.partition.fetch.bytes 93 session.timeout.ms and heartbeat.interval.ms 93 max.poll.interval.ms 94 default.api.timeout.ms 94 request.timeout.ms 94 auto.offset.reset 94 enable.auto.commit 95 partition.assignment.strategy 95 client.id 96 client.rack 96 group.instance.id 97 receive.buffer.bytes and send.buffer.bytes 97 offsets.retention.minutes 97 Commits and Offsets 97 Automatic Commit 99 Commit Current Offset 99 Asynchronous Commit 100 Combining Synchronous and Asynchronous Commits 102 Table of Contents | vii Commit Specified Offset 103 Rebalance Listeners 104 Consuming Records with Specific Offsets 107 But How Do We Exit? 109 Deserializers 111 Custom deserializers 111 Using Avro deserialization with Kafka consumer 114 Standalone Consumer: Why and How to Use a Consumer Without a Group 115 Summary 116 5. Managing Apache Kafka Programmatically. 117 AdminClient Overview 118 Asynchronous and Eventually Consistent API 118 Options 119 Flat Hierarchy 119 Additional Notes 119 AdminClient Lifecycle: Creating, Configuring and Closing 120 client.dns.lookup 120 request.timeout.ms 121 Essential Topic Management 122 Configuration management 126 Consumer group management 127 Exploring Consumer Groups 128 Modifying consumer groups 129 Cluster Metadata 131 Advanced Admin Operations 131 Adding partitions to a topic 131 Deleting records from a topic 132 Leader Election 132 Reassigning Replicas 134 Testing 135 Summary 137 6. Kafka Internals. 139 Cluster Membership 140 The Controller 140 KRaft - Kafka’s new Raft based controller 142 Replication 143 Request Processing 146 Produce Requests 148 Fetch Requests 149 Other Requests 151 viii | Table of Contents Physical Storage 153 Tiered Storage 153 Partition Allocation 155 File Management 156 File Format 157 Indexes 159 Compaction 161 How Compaction Works 161 Deleted Events 163 When Are Topics Compacted? 164 Summary 164 7. Reliable Data Delivery. 165 Reliability Guarantees 166 Replication 167 Broker Configuration 168 Replication Factor 169 Unclean Leader Election 170 Minimum In-Sync Replicas 171 Keeping Replicas In Sync 172 Persisting to disk 173 Using Producers in a Reliable System 173 Send Acknowledgments 174 Configuring Producer Retries 175 Additional Error Handling 175 Using Consumers in a Reliable System 176 Important Consumer Configuration Properties for Reliable Processing 177 Explicitly Committing Offsets in Consumers 178 Validating System Reliability 180 Validating Configuration 180 Validating Applications 181 Monitoring Reliability in Production 182 Summary 183 8. Exactly Once Semantics. 185 Idempotent Producer 186 How Does Idempotent Producer Work? 186 Limitations of the idempotent producer 189 How do I use Kafka idempotent producer? 189 Transactions 190 Use-Cases 191 What problems do Transactions solve? 191 Table of Contents | ix How Do Transactions Guarantee Exactly Once? 192 What problems aren’t solved by Transactions? 195 How Do I Use Transactions? 197 Transactional IDs and Fencing 200 How Transactions Work 202 Performance of Transactions 204 Summary 205 9. Building Data Pipelines. 207 Considerations When Building Data Pipelines 208 Timeliness 208 Reliability 209 High and Varying Throughput 209 Data Formats 210 Transformations 211 Security 211 Failure Handling 212 Coupling and Agility 213 When to Use Kafka Connect Versus Producer and Consumer 214 Kafka Connect 214 Running Connect 215 Connector Example: File Source and File Sink 217 Connector Example: MySQL to Elasticsearch 219 Single Message Transformations 226 A Deeper Look at Connect 228 Alternatives to Kafka Connect 232 Ingest Frameworks for Other Datastores 232 GUI-Based ETL Tools 232 Stream-Processing Frameworks 233 Summary 233 10. Cross-Cluster Data Mirroring. 235 Use Cases of Cross-Cluster Mirroring 237 Multicluster Architectures 238 Some Realities of Cross-Datacenter Communication 238 Hub-and-Spokes Architecture 239 Active-Active Architecture 241 Active-Standby Architecture 243 Stretch Clusters 249 Apache Kafka’s MirrorMaker 250 How to Configure 252 Multicluster replication topology 254 x | Table of Contents Securing MirrorMaker 255 Deploying MirrorMaker in Production 256 Tuning MirrorMaker 260 Other Cross-Cluster Mirroring Solutions 262 Uber uReplicator 262 LinkedIn Brooklin 263 Confluent Cross-Datacenter Mirroring Solutions 264 Summary 266 11. Securing Kafka. 267 Locking Down Kafka 268 Security Protocols 270 Authentication 271 SSL 272 SASL 277 Re-authentication 288 Security updates without downtime 290 Encryption 291 End-to-End Encryption 292 Authorization 293 AclAuthorizer 294 Customizing Authorization 297 Security Considerations 299 Auditing 300 Securing ZooKeeper 301 SASL 301 SSL 302 Authorization 302 Securing the Platform 303 Password Protection 303 Summary 305 12. Administering Kafka. 307 Topic Operations 308 Creating a New Topic 308 Listing All Topics in a Cluster 310 Describing Topic Details 310 Adding Partitions 312 Reducing Partitions 313 Deleting a Topic 313 Consumer Groups 314 List and Describe Groups

Kafka: the Definitive Guide Real-Time Data and Stream Processing at Scale

Large-Scale Learning from Data Streams with Apache SAMOA

DSP Frameworks DSP Frameworks We Consider

Comparative Analysis of Data Stream Processing Systems

Network Traffic Profiling and Anomaly Detection for Cyber Security

Projects – Other Than Hadoop! Created By:-Samarjit Mahapatra [email protected]

Reflexión Académica En Diseño & Comunicación

PDF Download Scaling Big Data with Hadoop and Solr

Classifying, Evaluating and Advancing Big Data Benchmarks

Storage and Ingestion Systems in Support of Stream Processing

A Study of Incremental Checkpointing in Distributed Stream Processing Systems

Apache Samza

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack