Kafka Streams and Ksqldb Building Real-Time Data Systems by Example
Total Page:16
File Type:pdf, Size:1020Kb
Mastering Kafka Streams and ksqlDB Building Real-Time Data Systems by Example Compliments of Mitch Seymour ksqlDB + Confluent Build and launch stream processing applications faster with fully managed ksqlDB from Confluent Simplify your stream processing Enable your business to Empower developers to start architecture in one fully innovate faster and operate in building apps faster on top of managed solution with two real-time by processing data in Kafka with ksqlDB’s simple components: Kafka and ksqlDB motion rather than data at rest SQL syntax USE ANY POPULAR CLOUD PROVIDER Try ksqlDB on Confluent Free for 90 Days TRY FREE New signups get up to $200 off their first three monthly bills, plus use promo code KSQLDB2021 for an additional $200 credit! Claim your promo code in-product Mastering Kafka Streams and ksqlDB Building Real-Time Data Systems by Example Mitch Seymour Beijing Boston Farnham Sebastopol Tokyo Mastering Kafka Streams and ksqlDB by Mitch Seymour Copyright © 2021 Mitch Seymour. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jessica Haberman Indexer: Ellen Troutman-Zaig Development Editor: Jeff Bleiel Interior Designer: David Futato Production Editor: Daniel Elfanbaum Cover Designer: Karen Montgomery Copyeditor: Kim Cofer Illustrator: Kate Dullea Proofreader: JM Olejarz February 2021: First Edition Revision History for the First Edition 2021-02-04: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492062493 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Kafka Streams and ksqlDB, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Confluent. See our statement of editorial inde‐ pendence. 978-1-098-10982-0 [LSI] Table of Contents Foreword. xiii Preface. xv Part I. Kafka 1. A Rapid Introduction to Kafka. 1 Communication Model 2 How Are Streams Stored? 6 Topics and Partitions 9 Events 11 Kafka Cluster and Brokers 12 Consumer Groups 13 Installing Kafka 15 Hello, Kafka 16 Summary 19 Part II. Kafka Streams 2. Getting Started with Kafka Streams. 23 The Kafka Ecosystem 23 Before Kafka Streams 24 Enter Kafka Streams 25 Features at a Glance 27 Operational Characteristics 28 v Scalability 28 Reliability 29 Maintainability 30 Comparison to Other Systems 30 Deployment Model 30 Processing Model 31 Kappa Architecture 32 Use Cases 33 Processor Topologies 35 Sub-Topologies 37 Depth-First Processing 39 Benefits of Dataflow Programming 41 Tasks and Stream Threads 41 High-Level DSL Versus Low-Level Processor API 44 Introducing Our Tutorial: Hello, Streams 45 Project Setup 46 Creating a New Project 46 Adding the Kafka Streams Dependency 47 DSL 48 Processor API 51 Streams and Tables 53 Stream/Table Duality 57 KStream, KTable, GlobalKTable 57 Summary 58 3. Stateless Processing. 61 Stateless Versus Stateful Processing 62 Introducing Our Tutorial: Processing a Twitter Stream 63 Project Setup 65 Adding a KStream Source Processor 65 Serialization/Deserialization 69 Building a Custom Serdes 70 Defining Data Classes 71 Implementing a Custom Deserializer 72 Implementing a Custom Serializer 73 Building the Tweet Serdes 74 Filtering Data 75 Branching Data 77 Translating Tweets 79 Merging Streams 81 Enriching Tweets 82 vi | Table of Contents Avro Data Class 83 Sentiment Analysis 85 Serializing Avro Data 87 Registryless Avro Serdes 88 Schema Registry–Aware Avro Serdes 88 Adding a Sink Processor 90 Running the Code 91 Empirical Verification 91 Summary 94 4. Stateful Processing. 95 Benefits of Stateful Processing 96 Preview of Stateful Operators 97 State Stores 98 Common Characteristics 99 Persistent Versus In-Memory Stores 101 Introducing Our Tutorial: Video Game Leaderboard 102 Project Setup 104 Data Models 104 Adding the Source Processors 106 KStream 106 KTable 107 GlobalKTable 109 Registering Streams and Tables 110 Joins 111 Join Operators 112 Join Types 113 Co-Partitioning 114 Value Joiners 117 KStream to KTable Join (players Join) 119 KStream to GlobalKTable Join (products Join) 120 Grouping Records 121 Grouping Streams 121 Grouping Tables 122 Aggregations 123 Aggregating Streams 123 Aggregating Tables 126 Putting It All Together 127 Interactive Queries 129 Materialized Stores 129 Accessing Read-Only State Stores 131 Table of Contents | vii Querying Nonwindowed Key-Value Stores 131 Local Queries 134 Remote Queries 134 Summary 142 5. Windows and Time. 143 Introducing Our Tutorial: Patient Monitoring Application 144 Project Setup 146 Data Models 147 Time Semantics 147 Timestamp Extractors 150 Included Timestamp Extractors 150 Custom Timestamp Extractors 152 Registering Streams with a Timestamp Extractor 153 Windowing Streams 154 Window Types 154 Selecting a Window 158 Windowed Aggregation 159 Emitting Window Results 161 Grace Period 163 Suppression 163 Filtering and Rekeying Windowed KTables 166 Windowed Joins 167 Time-Driven Dataflow 168 Alerts Sink 170 Querying Windowed Key-Value Stores 170 Summary 173 6. Advanced State Management. 175 Persistent Store Disk Layout 176 Fault Tolerance 177 Changelog Topics 178 Standby Replicas 180 Rebalancing: Enemy of the State (Store) 180 Preventing State Migration 181 Sticky Assignment 182 Static Membership 185 Reducing the Impact of Rebalances 186 Incremental Cooperative Rebalancing 187 Controlling State Size 189 Deduplicating Writes with Record Caches 195 viii | Table of Contents State Store Monitoring 196 Adding State Listeners 196 Adding State Restore Listeners 198 Built-in Metrics 199 Interactive Queries 200 Custom State Stores 201 Summary 202 7. Processor API. 203 When to Use the Processor API 204 Introducing Our Tutorial: IoT Digital Twin Service 205 Project Setup 208 Data Models 209 Adding Source Processors 211 Adding Stateless Stream Processors 213 Creating Stateless Processors 214 Creating Stateful Processors 217 Periodic Functions with Punctuate 221 Accessing Record Metadata 223 Adding Sink Processors 225 Interactive Queries 225 Putting It All Together 226 Combining the Processor API with the DSL 230 Processors and Transformers 231 Putting It All Together: Refactor 235 Summary 236 Part III. ksqlDB 8. Getting Started with ksqlDB. 239 What Is ksqlDB? 240 When to Use ksqlDB 241 Evolution of a New Kind of Database 243 Kafka Streams Integration 243 Connect Integration 246 How Does ksqlDB Compare to a Traditional SQL Database? 247 Similarities 248 Differences 249 Architecture 251 ksqlDB Server 251 Table of Contents | ix ksqlDB Clients 253 Deployment Modes 255 Interactive Mode 255 Headless Mode 256 Tutorial 257 Installing ksqlDB 257 Running a ksqlDB Server 258 Precreating Topics 259 Using the ksqlDB CLI 259 Summary 262 9. Data Integration with ksqlDB. 263 Kafka Connect Overview 264 External Versus Embedded Connect 265 External Mode 266 Embedded Mode 267 Configuring Connect Workers 268 Converters and Serialization Formats 270 Tutorial 272 Installing Connectors 272 Creating Connectors with ksqlDB 273 Showing Connectors 275 Describing Connectors 276 Dropping Connectors 277 Verifying the Source Connector 277 Interacting with the Kafka Connect Cluster Directly 278 Introspecting Managed Schemas 279 Summary 279 10. Stream Processing Basics with ksqlDB. 281 Tutorial: Monitoring Changes at Netflix 281 Project Setup 284 Source Topics 284 Data Types 285 Custom Types 287 Collections 288 Creating Source Collections 289 With Clause 291 Working with Streams and Tables 292 Showing Streams and Tables 292 Describing Streams and Tables 294 x | Table of Contents Altering Streams and Tables 295 Dropping Streams and Tables 295 Basic Queries 296 Insert Values 296 Simple Selects (Transient Push Queries) 298 Projection 299 Filtering 300 Flattening/Unnesting Complex Structures 302 Conditional Expressions 302 Coalesce 303 IFNULL 303 Case Statements 303 Writing Results Back to Kafka (Persistent Queries) 304 Creating Derived Collections 304 Putting It All Together 308 Summary 309 11. Intermediate and Advanced Stream Processing with ksqlDB. 311 Project Setup 312 Bootstrapping an Environment from a SQL File 312 Data Enrichment 314 Joins 314 Windowed Joins 319 Aggregations 322 Aggregation Basics 323 Windowed Aggregations 325 Materialized Views 331 Clients 332 Pull Queries 333 Curl 335 Push Queries 336 Push Queries via Curl 336 Functions and Operators 337 Operators 337 Showing Functions 338 Describing Functions 339 Creating