Getting Started with Kudu PERFORM FAST ANALYTICS on FAST DATA

Getting Started with Kudu PERFORM FAST ANALYTICS ON FAST DATA Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland & Ryan Bosshart Getting Started with Kudu Perform Fast Analytics on Fast Data Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, and Ryan Bosshart Beijing Boston Farnham Sebastopol Tokyo Getting Started with Kudu by Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, and Ryan Bosshart Copyright ©2018 Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, Ryan Bosshart. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Tache Interior Designer: David Futato Production Editor: Colleen Cole Cover Designer: Randy Comer Copyeditor: Dwight Ramsey Illustrator: Melanie Yarbrough Proofreaders: Charles Roumeliotis and Octal Technical Reviewers: David Yahalom, Andy Stadtler, Publishing, Inc. Attila Bukor, and Peter Paton Indexer: Judy McConville July 2018: First Edition Revision History for the First Edition 2018-07-09: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491980255 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Getting Started with Kudu, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-98025-5 [LSI] To all those data people who day-in and day-out lead hair-pulling, brain-teasing, late-night lives architecting, developing, or consulting on software that appears to have gone rogue and deliberately misbehaves. To our families, who may not even care about technology, yet still allowed us to give up time and energy to dedicate to this project with an enormous amount of patience and support, without which none of this was possible. We love you! Table of Contents Preface. ix 1. Why Kudu?. 1 Why Does Kudu Matter? 1 Simplicity Drives Adoption 2 New Use Cases 4 IoT 4 Current Approaches to Real-Time Analytics 5 Real-Time Processing 10 Hardware Landscape 12 Kudu’s Unique Place in the Big Data Ecosystem 13 Comparing Kudu with Other Ecosystem Components 15 Big Data—HDFS, HBase, Cassandra 18 Conclusion 19 2. About Kudu. 21 Kudu High-Level Design 22 Kudu Roles 23 Master Server 24 Tablet Server 25 Kudu Concepts and Mechanisms 32 Hotspotting 32 Partitioning 33 3. Getting Up and Running. 37 Installation 37 Apache Kudu Quickstart VM 37 Using Cloudera Manager 39 Building from Source 40 Packages 40 v Cloudera Quickstart VM 40 Quick Install: Three Minutes or Less 41 Conclusion 44 4. Kudu Administration. 45 Planning for Kudu 45 Master and Tablet Servers 46 Write-Ahead Log 50 Data Servers and Storage 52 Replication Strategies 53 Deployment Considerations: New or Existing Clusters? 54 New Kudu-Only Cluster 54 New Hadoop Cluster with Kudu 54 Add Kudu to Existing Hadoop Cluster 59 Web UI of Tablet and Master Servers 63 Master Server UI and Tablet Server UI 63 Master Server UI 64 Tablet Server UI 64 The Kudu Command-Line Interface 64 Cluster 65 Filesystem 66 Tablet Replica 70 Consensus Metadata 80 Adding and Removing Tablet Servers 81 Adding Tablet Servers 81 Removing a Tablet Server 82 Security 82 A Simple Analogy 83 Kudu Security Features 84 Basic Performance Tuning 88 Kudu Memory Limits 89 Maintenance Manager Threads 89 Monitoring Performance 90 Getting Ahead and Staying Out of Trouble 90 Avoid Running Out of Disk Space 90 Disk Failures Tolerance 91 Backup 91 Conclusion 92 5. Common Developer Tasks for Kudu. 93 Client API 94 Kudu Client 94 vi | Table of Contents Kudu Table 94 Kudu DDL 94 Kudu Scanner Read Modes 95 C++ API 96 Python API 98 Preparing the Python Development Environment 98 Python Kudu Application 99 Java 102 Java Application 103 Spark 105 Impala with Kudu 109 6. Table and Schema Design. 111 Schema Design Basics 111 Schema for Hybrid Transactional/Analytical Processing 112 Lambda Architecture 113 OLTP/OLAP Split 113 Primary Key and Column Design 114 Other Column Schema Considerations 115 Partitioning Basics 119 Range Partitioning 120 Hash Partitioning 120 Schema Alteration 121 Best Practices and Tips 121 Partitioning 121 Large Objects 122 decimal 122 Unique Strings 122 Compression 122 Object Names 123 Number of Columns 123 Binary Types 123 Network Packet Example 123 Conclusion 125 7. Kudu Use Cases. 127 Real-Time Internet of Things Analytics 127 Predictive Modeling 130 Mixed Platforms Solution 132 Index. 135 Table of Contents | vii Preface Choosing a storage engine is one of the most important decisions anyone embarking on a big data project makes and is one of the most expensive to change. Apache Kudu is an entirely new storage manager for the Hadoop ecosystem. Its flexibility makes applications faster to build and easier to maintain. As a Hadoop developer, Kudu is a critical skill in your big data toolbox. It addresses common problems in big data that are difficult or impossible to implement on current generation Hadoop storage tech‐ nologies. In this book, you will learn key concepts of Kudu’s design and how to architect appli‐ cations against it, resulting in Kudu applications that are fast, scalable, and reliable. Through hands-on examples, you will learn how Kudu integrates with other Hadoop ecosystem components like Apache Spark, SparkSQL, and Impala. This book assumes some limited experience with Hadoop ecosystem components like HDFS, Hive, Spark, or Impala. Basic programming experience using Java and/or Scala, experience with SQL and traditional RDBMS systems, and familiarity with the Linux shell is also assumed. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. ix Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/kudu-book/getting-started-kudu. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Getting Started with Kudu by Jean- Marc Spaggiari, Mladen Kovacevic, Brock Noland, Ryan Bosshart (O’Reilly). Copy‐ right 2018 Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, Ryan Bosshart, 978-1-491-98025-5.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected]. x | Preface O’Reilly Safari Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals. Members have access to thousands of books, training videos, Learning Paths, interac‐ tive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes‐ sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others. For more information, please visit http://oreilly.com/safari. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/getting-started-with-kudu. To comment or ask technical questions about this book, send email to bookques‐ [email protected]. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Preface | xi Acknowledgments We would like to thank the people of the Apache Kudu community for their help.

Getting Started with Kudu PERFORM FAST ANALYTICS on FAST DATA

Netapp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (Netapp) and Udai Potluri (Cloudera) June 2018 | WP-7217

Groups and Activities Report 2017

CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer

Kyuubi Release 1.3.0 Kent

Chapter 2 Introduction to Big Data Technology

Release Notes Date Published: 2020-08-10 Date Modified

Storage and Ingestion Systems in Support of Stream Processing

Cloudera Enterprise

Towards a Unified Ingestion-And-Storage Architecture

Release Notes Date Published: 2020-10-13 Date Modified

Building a Scalable Distributed Data Platform Using Lambda Architecture

Red Hat Fuse 7.3 Release Notes