Architecting Hbase Applications a GUIDEBOOK for SUCCESSFUL DEVELOPMENT and DESIGN
Total Page:16
File Type:pdf, Size:1020Kb
Architecting HBase Applications A GUIDEBOOK FOR SUCCESSFUL DEVELOPMENT AND DESIGN Jean-Marc Spaggiari & Kevin O'Dell www.allitebooks.com www.allitebooks.com Architecting HBase Applications A Guidebook for Successful Development and Design Jean-Marc Spaggiari and Kevin O’Dell Beijing Boston Farnham Sebastopol Tokyo www.allitebooks.com Architecting HBase Applications by Jean-Marc Spaggiari and Kevin O’Dell Copyright © 2016 Jean-Marc Spaggiari and Kevin O’Dell. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editor: Marie Beaugureau Indexer: WordCo Indexing Services, Inc. Production Editor: Nicholas Adams Interior Designer: David Futato Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomery Proofreader: Amanda Kersey Illustrator: Rebecca Demarest August 2016: First Edition Revision History for the First Edition 2016-07-14: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491915813 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Architecting HBase Applications, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91581-3 [LSI] www.allitebooks.com To my father. I wish you could have seen it... —Jean-Marc Spaggiari To my mother, who I think about every day; my father, who has always been there for me; and my beautiful wife Melanie and daughter Scotland, for putting up with all my com‐ plaining and the extra long hours. —Kevin O’Dell www.allitebooks.com www.allitebooks.com Table of Contents Foreword. xi Preface. xiii Part I. Introduction to HBase 1. What Is HBase?. 3 Column-Oriented Versus Row-Oriented 5 Implementation and Use Cases 5 2. HBase Principles. 7 Table Format 7 Table Layout 8 Table Storage 9 Internal Table Operations 15 Compaction 15 Splits (Auto-Sharding) 17 Balancing 19 Dependencies 19 HBase Roles 20 Master Server 21 RegionServer 21 Thrift Server 22 REST Server 22 3. HBase Ecosystem. 25 Monitoring Tools 25 v www.allitebooks.com Cloudera Manager 26 Apache Ambari 28 Hannibal 32 SQL 33 Apache Phoenix 33 Apache Trafodion 33 Splice Machine 34 Honorable Mentions (Kylin, Themis, Tephra, Hive, and Impala) 34 Frameworks 35 OpenTSDB 35 Kite 36 HappyBase 37 AsyncHBase 37 4. HBase Sizing and Tuning Overview. 39 Hardware 40 Storage 40 Networking 41 OS Tuning 42 Hadoop Tuning 43 HBase Tuning 44 Different Workload Tuning 46 5. Environment Setup. 49 System Requirements 50 Operating System 50 Virtual Machine 50 Resources 52 Java 53 HBase Standalone Installation 53 HBase in a VM 56 Local Versus VM 57 Local Mode 57 Virtual Linux Environment 58 QuickStart VM (or Equivalent) 58 Troubleshooting 59 IP/Name Configuration 59 Access to the /tmp Folder 59 Environment Variables 59 Available Memory 60 First Steps 61 Basic Operations 61 vi | Table of Contents www.allitebooks.com Import Code Examples 62 Testing the Examples 66 Pseudodistributed and Fully Distributed 68 Part II. Use Cases 6. Use Case: HBase as a System of Record. 73 Ingest/Pre-Processing 74 Processing/Serving 75 User Experience 79 7. Implementation of an Underlying Storage Engine. 83 Table Design 83 Table Schema 84 Table Parameters 85 Implementation 87 Data conversion 88 Generate Test Data 88 Create Avro Schema 89 Implement MapReduce Transformation 89 HFile Validation 94 Bulk Loading 95 Data Validation 96 Table Size 97 File Content 98 Data Indexing 100 Data Retrieval 104 Going Further 105 8. Use Case: Near Real-Time Event Processing. 107 Ingest/Pre-Processing 110 Near Real-Time Event Processing 111 Processing/Serving 112 9. Implementation of Near Real-Time Event Processing. 115 Application Flow 117 Kafka 117 Flume 118 HBase 118 Lily 120 Solr 120 Table of Contents | vii www.allitebooks.com Implementation 121 Data Generation 121 Kafka 122 Flume 123 Serializer 130 HBase 134 Lily 136 Solr 138 Testing 139 Going Further 140 10. Use Case: HBase as a Master Data Management Tool. 141 Ingest 142 Processing 143 11. Implementation of HBase as a Master Data Management Tool. 147 MapReduce Versus Spark 147 Get Spark Interacting with HBase 148 Run Spark over an HBase Table 148 Calling HBase from Spark 148 Implementing Spark with HBase 149 Spark and HBase: Puts 150 Spark on HBase: Bulk Load 154 Spark Over HBase 156 Going Further 160 12. Use Case: Document Store. 161 Serving 163 Ingest 164 Clean Up 166 13. Implementation of Document Store. 167 MOBs 167 Storage 169 Usage 170 Too Big 170 Consistency 172 Going Further 173 viii | Table of Contents www.allitebooks.com Part III. Troubleshooting 14. Too Many Regions. 177 Consequences 177 Causes 178 Misconfiguration 178 Misoperation 179 Solution 179 Before 0.98 179 Starting with 0.98 185 Prevention 187 Regions Size 187 Key and Table Design 189 15. Too Many Column Families. 191 Consequences 192 Memory 192 Compactions 193 Split 193 Causes, Solution, and Prevention 193 Delete a Column Family 194 Merge a Column Family 194 Separate a Column Family into a New Table 196 16. Hotspotting. 197 Consequences 197 Causes 198 Monotonically Incrementing Keys 198 Poorly Distributed Keys 198 Small Reference Tables 199 Applications Issues 200 Meta Region Hotspotting 200 Prevention and Solution 200 17. Timeouts and Garbage Collection. 203 Consequences 203 Causes 206 Storage Failure 206 Power-Saving Features 206 Network Failure 207 Solutions 207 Prevention 207 Table of Contents | ix Reduce Heap Size 208 Off-Heap BlockCache 208 Using the G1GC Algorithm 209 Configure Swappiness to 0 or 1 210 Disable Environment-Friendly Features 211 Hardware Duplication 211 18. HBCK and Inconsistencies. 213 HBase Filesystem Layout 213 Reading META 214 Reading HBase on HDFS 215 General HBCK Overview 217 Using HBCK 218 Index. 223 x | Table of Contents Foreword Throughout my 25 years in the software industry, I have experienced many disruptive changes: the Internet, the World Wide Web, mainframes, the client–server model, and more. I once worked on a team that was implementing software to make oil refineries safer. Our 40-person team shared a single DEC VAX—a machine no more powerful than the cell phone I use today. I still remember a day in the early 90s when we were scheduled to receive new machines. The machines being replaced were located on the third floor, and were large, heavy, washing machine–sized behemoths. A bunch of us waited inside the “machine room” to see how they would heave the new machines up all those flights of stairs. We imagined a giant crane, the street outside being blocked off…in short, a big operation! But what actually happened was quite different. A man entered the room carrying a small box under his arm. He placed it on top of one of the old “washing machines,” switched some cables around, did some tests, and left. That was it? Wow. Things change! That is the joy of being part of the tech industry: if we are willing to learn new things and move with it, we will never be bored, and will never cease to be amazed. What seemed impossible just a few years ago is suddenly commonplace. Big data is such a change. Big data is everywhere. The revolution started by Google with the Google File System and BigTable has now reached almost every tech com‐ pany, bank, government, and startup. Directly or indirectly, for better or for worse, these systems touch the lives of almost every human being on the planet. Apache HBase, like BigTable before it, has a unique place in this ecosystem: it pro‐ vides an updatable view into essentially unlimited datasets stored in an immutable, distributed filesystem. As such, it bridges the gap between pure file storage and OLTP/OLAP databases. xi HBase is everywhere: Facebook, Apple, Salesforce.com, Adobe, Yahoo!, Bloomberg, Huawei, The Gap, and many other companies use it. Google adopted the HBase API for its public cloud BigTable offering, a testament to the popularity of HBase. Despite its ubiquity, HBase is not plug and play. Distributed systems are hard. Terms such as partition tolerance, consistency, and availability inevitably creep into every dis‐ cussion, soon followed by even more esoteric terms such as hotspotting and salting. Scaling to hundreds or thousands of machines requires painful trade-offs, and these trade-offs make it harder to use these systems optimally. HBase is no exception. In my years in the HBase and Hadoop communities, I have experienced these chal‐ lenges firsthand. Use cases must be designed and architected carefully in order to play to the strengths of HBase. This book, written by two insiders who have been on the ground supporting custom‐ ers, is a much needed guide that details how to architect applications that will work well with HBase and that will scale to thousands of machines. If you are building or are planning to build new applications that need highly scalable and reliable storage, this book is for you.