PRAISE FOR UNDERSTANDING BIG DATA

“This book is useful to anyone who works with data and wants to learn more about scaling. Cory helps you understand what causes to slow down as data volumes grow over time. He then reviews a number of strategies that you have at your disposal to manage the growth, including software and tuning, hardware upgrades, read- replication, and ultimately horizontal partitioning of data.” —Dan Lynn, cofounder of FullContact

“Understanding Big Data Scalability presents the fundamentals of scaling databases from a single node to large clusters. It provides a practical explanation of what Big Data systems are, and the fundamental issues to consider when optimizing for performance and scalability. Cory draws on his many years of database experience to explain the issues involved in working with data sets that can no longer be handled with single monolithic relational databases.

“When transitioning from a traditional relational database deployment, it is tempting to ignore traditional database discipline regarding data modeling and data integrity. Much of this has been motivated by the proliferation of schema-less NoSQL databases. In spite of this trend, Cory shows why it is still important to carefully structure your data to maintain data integrity and allow sharding in such a way as to avoid costly distributed scan/shuffle operations. He discusses a practical approach to this called relational sharding. This is a commonsense method that avoids the pitfalls of black-box sharding. Cory’s approach is particularly relevant now that relational data models are making a comeback via SQL interfaces to popular NoSQL databases and Hadoop distributions.

“Understanding Big Data Scalability addresses practical problems in Big Data processing systems using real-life examples. This book should be especially useful to database practitioners new to the process of scaling a database beyond a traditional single-node deployment.” —Brian O’Krafka, software architect “Software is like magic, or so many people think. In practice, many fundamental principles of computation are constrained by immutable principles derived from the laws of physics (think transistors). One such limitation is easily understood by analyzing the trade-offs between horizontal and vertical scaling. Cory does a great job justifying the inescapable need for distributed systems, while also acknowledging their fundamental pitfalls. Good read for those who are trying to understand databases from first principles, rather than by analogy.” —Ivan Bercovich, senior director of engineering, FindTheBest.com

“My first database project was in 1982. Since then, Moore’s law has expressed itself in just about every corner of technology—except database and database management. I’d argue alongside Cory that a significant part of the database ‘innovations’ we have experienced since then have been mostly an illusion. Can this be true? Really? If there is just a tinge of doubt in your mind, open these pages—Cory wrote this book for you.” —Matthew Rockwell, systems design and architecture consultant

“I work with customers across the spectrum—from small start-ups to large enterprises— and they all have one thing in common: concerns and questions about how to handle database growth as their user base expands. When they have complex database scaling issues, I send them to Cory Isaacson and this book illustrates why. Not only can Cory explain the ‘whys’ behind their issues in clear language with relevant examples and analogies, but he also has the in-depth knowledge to handle the ‘hows’ in addressing those problems.” —Brian Adler, principal cloud architect, RightScale, Inc. UNDERSTANDING BIG DATA SCALABILITY UNDERSTANDING BIG DATA SCALABILITY

BIG DATA SCALABILITY SERIES PART I

Cory Isaacson

Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City Many of the designations used by manufacturers and sellers to Executive Editor distinguish their products are claimed as trademarks. Where those Bernard Goodwin designations appear in this book, and the publisher was aware of a Development Editor trademark claim, the designations have been printed with initial capital Chris Zahn letters or in all capitals. Managing Editor The author and publisher have taken care in the preparation of this John Fuller book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is Full-Service assumed for incidental or consequential damages in connection with or Production Manager arising out of the use of the information or programs contained herein. Julie B. Nahil For information about buying this title in bulk quantities, or for special Project Editor sales opportunities (which may include electronic versions; custom Anna Popick cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate Copy Editor sales department at [email protected] or (800) 382-3419. Anna Popick For government sales inquiries, please contact Indexer Jack Lewis [email protected]. For questions about sales outside the United States, please contact Proofreader [email protected]. Anna Popick Visit us on the Web: informit.com/ph Editorial Assistant Michelle Housley Copyright © 2015 Pearson Education, Inc. Compositor All rights reserved. This publication is protected by copyright, and Anna Popick permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission to use material from this work, please submit a written request to Pearson Education, Inc., Permissions Department, One Lake Street, Upper Saddle River, New Jersey 07458, or you may fax your request to (201) 236-3290. ISBN-13: 978-0-13-359870-4 ISBN-10: 0-13-359870-5 Second digital release, August 2014 CONTENTS

Preface ...... ix About the Author ...... xii Chapter 1: Introduction ...... 1 What You Will Learn ...... 1 The Challenge of Big Data ...... 2 Today’s Big Data Explosion ...... 3 Managing and Capitalizing on the Current Data Boom ...... 3 Your Role as a Data Architect ...... 5 The Acceleration of Big Data Innovation ...... 5 Background for This Book ...... 6 Why the Focus on Database Sharding? ...... 8 Summary ...... 9 Chapter 2: Why Databases Slow Down ...... 10 The Database Slowdown Curve ...... 10 A Hard-Won Lesson ...... 11 The Root Cause ...... 12 The Lesson Learned ...... 13 The Enemies of Database Performance ...... 14 Enemy #1: Table Scans ...... 14 Enemy #2: Slow Writes ...... 16 Enemy #3: Concurrency Contention ...... 17 Enemy #4: Missing Indexes ...... 20 How to Identify Database Slowdown Issues ...... 21 Summary ...... 23 Chapter 3: What Is Big Data? ...... 24 What Is Big Data Anyhow? ...... 24 A Formal Definition for Big Data ...... 25 A Practical Big Data Definition ...... 27 Sources of Big Data ...... 28 The Advent of the Search Engine ...... 29 The Rise of Social Networks ...... 29 Introducing the Social Network Application ...... 30 The Adoption of the Smartphone and Tablets ...... 31 Traditional Big Data Sources ...... 31 Future Big Data Generators ...... 31 Summary ...... 32 Chapter 4: Big Data in the Real World ...... 33 Some Real-World Examples of Big Data ...... 33 FullContact ...... 34 vi Contents The Big Data Challenge ...... 35 Adopting a Big Data Streaming Approach ...... 35 The Big Data Repository ...... 36 Positioned for Rapid Growth ...... 36 Social Point ...... 36 Scaling from the Start ...... 37 The Current Cluster ...... 37 The Database Cluster Payoff ...... 38 Summary ...... 38 Chapter 5: Scaling Your Application ...... 39 The Goals of a Scalable Application Platform ...... 39 The Excitement of a High-Growth Success ...... 41 Application Scalability Fundamentals ...... 42 Detecting Performance Bottlenecks ...... 42 Scalability and Reliability ...... 44 Scaling Up ...... 45 Scaling Out ...... 46 A Typical Online Application Architecture ...... 46 The Load Balancer Tier ...... 47 The Application Server Tier ...... 49 The Database Tier ...... 50 Analytics Application Architectures ...... 50 Scaling an Analytics Application ...... 53 How to Scale a Traditional Online Application ...... 53 Load Balancer Tier ...... 53 Application Server Tier ...... 54 Database Tier ...... 54 Summary ...... 55 Chapter 6: When to Scale Your Database ...... 56 The Last Mile of Application Scalability ...... 56 How Do You Know When to Scale Your Database? ...... 57 Options for Increasing Database Performance ...... 58 Performance Optimization on Your Monolithic Database ...... 59 Vertical Scaling ...... 59 Read Scaling ...... 62 Implementing a Full Big Data Scalability Environment ...... 63 Indications of the Need for Scale ...... 65 Slow Read Queries ...... 66 Slow Database Writes ...... 67 Summary ...... 68 Chapter 7: All Data Is Relational ...... 69 Relational Data Overview ...... 69 The Meaning of Data ...... 70 Relationships Matter ...... 73 Why Data Modelling Is Critical to Success ...... 74 Summary ...... 76

Contents vii Chapter 8: It’s All About Sharding ...... 77 Sharding: The Ultimate Answer to Database Slowdown ...... 77 The Laws of Databases ...... 78 Sharding Defined ...... 80 Black-Box Sharding ...... 83 Relational Sharding ...... 86 Summary ...... 88 Chapter 9: Scaling Big Data: The Endgame ...... 89 The Game of Big Data Scalability ...... 89 The Ideal Big Data Infrastructure ...... 90 Scaling Big Data Theory ...... 90 The Good, the Bad ...... 91 . . . And the Ugly ...... 94 The Big Data Endgame ...... 95 Data Locality ...... 98 Summary ...... 99 Index ...... 101

viii Contents PREFACE

The Big Data Scalability series is a comprehensive, four-part volume containing information on many facets of database performance and scalability. This book, Understanding Big Data Scalability, is the first book in the series.

There is no doubt that Big Data and scalability are some of the hottest and most important topics in today’s fast-growing applications. Not only is data constantly growing, the rate at which we accumulate data from a variety of sources is accelerating at a dramatic pace.

Understanding Big Data Scalability provides the foundation and important basic concepts for learning how scaling a Big Data infrastructure operates, and covers the most important governing factors for implementing a scalable and dynamic Big Data cluster. Later books in the series will delve into more technical detail, including the various data architectures available, optimization techniques, innovative ways to model and process data, and the types of database management system (DBMS) engines that can be used.

This book (and series) is targeted at application architects, database architects, and those with a need to scale their database tier. It is oriented toward high-volume applications.

While the subject of Big Data analytics is covered to some degree in the context of how it fits with typical online and business applications, this is not a book that concentrates exclusively on Hadoop or MapReduce batch-processing analytics. There are many books on this subject, and rather than writing another possibly redundant text, I decided to focus on the common applications we deal with every day. No doubt, these applications will require analytics capabilities, including MapReduce batch-processing, so where I discuss this approach I do so in the context of how it fits into a typical application infrastructure. In my experience, our “common” applications today are experiencing unprecedented data growth, often from multiple sources. They are driving a need for real-time processing at rates that would have been inconceivable just a short time ago. Therefore, I hope to provide you with a different perspective—one that has not been thoroughly covered by other authors.

Preface ix Understanding Big Data Scalability includes the following chapters:

• Chapter 1, “Introduction”: This initial chapter goes into far more detail on the purpose of the book, the origins of Big Data, and what is to be covered. • Chapter 2, “Why Databases Slow Down”: Simply put, today’s typical monolithic databases frequently experience performance issues. By exploring the causes of this performance degradation, you will understand the basic principles involved that drive the need for Big Data scalability—as well as the foundation for knowing how to properly implement a scalable data architecture. • Chapter 3, “What Is Big Data?”: It is important to establish a good working definition for Big Data and how it affects your own applications. • Chapter 4, “Big Data in the Real World”: In this chapter I cover two actual case studies of common applications that had a need to scale. You will see how these solutions were implemented—information that you can use to formulate your own Big Data scalability strategy. • Chapter 5, “Scaling Your Application”: Any scalability strategy includes scaling all of your application components. That is covered here, including the various tiers of a typical application and what is involved in scaling each. • Chapter 6, “When to Scale Your Database”: Scaling your data tier is an important decision with large consequences in any application environment. It is important to know when you need to scale your database, and what factors drive scalability; these items are covered in this chapter. • Chapter 7, “All Data Is Relational”: A fundamental concept that affects all application data is data relationships. This is true regardless of the type of DBMS engine you are using, and in fact it can be even more important with non- relational databases. Thus I provide a brief overview of the importance of data relationships and how they pervade virtually everything to do with working with data. • Chapter 8, “It’s All About Sharding”: In this chapter you will learn why virtually every database scalability implementation relies on some form of database sharding. You will also learn the various approaches to sharding, and the strengths and weaknesses of each. • Chapter 9, “Scaling Big Data: The Endgame”: In this final chapter of Part I of the series, I cover the most important objectives of a scalable Big Data implementation. These objectives will serve as the most important guidelines for ensuring your architecture and implementation are both successful and high performance.

x Preface There is a companion Web site for the entire series, with supplementary blog posts, forum discussions, and other relevant information related to the Big Data Scalability Series: www.bigdatascalability.com.

I hope you find this book informative and, most of all, useful, generating practical results in your own Big Data implementations. I am always interested in hearing feedback from readers, as well; you can reach me through the Big Data Scalability Series Web site noted above.

Preface xi ABOUT THE AUTHOR

Cory Isaacson is CEO/CTO of CodeFutures Corporation, maker of dbShards, a leading database scalability suite providing a true shared-nothing architecture for relational databases. Cory has authored numerous articles in a variety of publications, including SOA Magazine and Database Trends and Applications. He is a regular columnist for JAXenter magazine, and the author of the book Software Pipelines and SOA (Addison-Wesley, 2009). Cory has more than twenty-five years’ experience with advanced software architectures, and has worked with many of the world’s brightest innovators in the field of high-performance computing. Cory has spoken at hundreds of public events and seminars, and has helped numerous organizations address the real-world challenges of database performance and scalability—effectively applying Big Data technologies to a variety of fields, including social networking, mobile applications, gaming, and high-volume transaction systems. In his prior position as president of Rogue Wave Software, he actively led the company back to a position of profitable growth, culminating in a successful acquisition by a leading private equity firm. Cory can be reached through the Big Data Scalability Series Web site: www.bigdatascalability.com.

xii About the Author INTRODUCTION1

Big Data Scalability is a comprehensive, four-part volume containing information on many facets of database performance and scalability. I have found through years of experience that in order to master any aspect of software, databases, or computer science, the most important thing is a solid foundation and understanding of the fundamental concepts. Once you have this knowledge, you can apply it to solve real-world problems with your own database tier.

This series is primarily focused on how to apply Big Data scalability to common application problems with the wide selection of DBMS engines available today. As such, it is less focused on pure analytics Big Data problems with engines like Apache Hadoop (there are many other books dedicated to that), and more directed toward taking advantage of Big Data scalability with popular DBMS engines used in enterprise and online applications.

WHAT YOU WILL LEARN This part of the series will cover the following critical topics for building a high- performance, scalable, and reliable database tier:

• The primary causes of database performance issues • What Big Data is

Chapter 1: Introduction 1 • Options for scaling your application • When and how to scale your database • The fundamentals of data distribution (database sharding)

THE CHALLENGE OF BIG DATA Have you ever experienced a database performance problem? It’s a silly question, I know. Everyone who has worked with databases for any amount of time has seen performance issues. In fact, it’s so common that database performance is almost an oxymoron. The very nature of database technology is to get slower over time, sometimes dramatically so. Why? Because databases only tend to get bigger through increased usage, and of course with more data the database management system (DBMS) has to do more work to keep up with application demands. Add to the mix an explosive high-transaction volume and you have the perfect recipe for something we all dread: slow database performance. In fact, sometimes it’s downright horrible database performance.

What about the memorable experience of a “database down” on a live production system? This can easily be your worst day on the job. Sometimes, when you get that call, you wish you could just forget it and not even go to work. Not only are big, growing databases hard to keep fast, they are very painful to repair when something bad happens. I’ve seen applications that were down for days due to extended database recovery times, and I’m sure you have heard of some notorious all-too- public examples of this as well. So an all-important part of a Big Data success is, in fact, planning for failure—ensuring you are prepared, with a robust, high- availability (HA) runtime environment and a solid disaster recovery (DR) plan for a worst-case scenario (they do happen).

What can you do about it? That’s what this book is all about: what you can do to make your database tier fast, scalable, and reliable, delivering the performance that your application needs. The focus of this portion of the book is not on any particular DBMS or database technology, but rather the answers to these questions—questions that apply to any database and any DBMS technology:

• What are the primary causes of database slow-down? • How do you keep your database tier reliable and operational (especially in demanding, 24X7 environments)? • Are there short-term, quick fixes you can implement to ease the pain?

2 Chapter 1: Introduction • What are the best ways to scale your database to accommodate extreme growth in transaction volume and data sizes? • How do you pick the best database technology (often technologies) for your application needs?

The answers to these questions and more are covered throughout the text. The aim is to arm you with the important concepts and facts common to all DBMS types, so that you can make the best decisions possible in managing your own data explosion, and come out on top.

Please note that in subsequent parts of the series I will provide a tour and overview of many common and leading-edge DBMS engines, both conceptually by category as well as specific solutions you can use in your Big Data arsenal. With this information, combined with the fundamentals covered in Part I, my hope is to assist you in preparing for your own Big Data adventure—commensurate with much success and reward.

TODAY’S BIG DATA EXPLOSION Next let’s look at the drivers and scope of Big Data.

MANAGING AND CAPITALIZING ON THE CURRENT DATA BOOM We are living in the midst of a data explosion, a true boom in databases and database technology, the likes of which the world has never seen. Since you are reading this book, I assume you have an interest in database performance and scalability, the same as I do. Figure 1.1 illustrates the Big Data explosion by the current data boom, and how critical it is for us to be able to extract meaning from all of this data.

Figure 1.1 The Big Data explosion

What do I mean by a data boom? Given that information is often the most valuable commodity in today’s tech-centric world, this means that we as data professionals

Chapter 1: Introduction 3 hold the keys to the kingdom. Not only is data getting more massive each and every year, the rate at which data is being generated is accelerating at a super-linear rate. It’s tough enough to figure out how to capture and manage all that data, but more importantly we need to identify the right set of architectural decisions, tools, and capabilities to allow our organizations to capitalize on that data. In other words, giving meaning to raw data is just as important as the collection, reliability, and management of Big Data environments.

The causes of the explosion of database data are not hard to find:

• The advent of the Internet and the World Wide Web has generated exponential growth in the global user community—users with ever-expanding access to computing power and bandwidth. • The interaction of these users with Internet applications has resulted in unprecedented levels of data and transaction volumes. • The shift to online advertising supported by the likes of Google, Yahoo, and others is a key driver in the data boom we are seeing today. • The overall expansion of the worldwide economy has spurred massive data growth for traditional commerce (e.g., increased airline travel, international purchases, online products, etc.). • The core social networks (e.g., Facebook, Twitter, LinkedIn, and now Google+), by their very nature, have generated massive new ways for people to communicate and interact, resulting in correspondingly large data sets and transaction volume. • Many specialized social networks have also arisen—everything from match- making sites to special interest groups, and even “buy-sell” applications that have generated their own micro-economies. • An entirely new breed of social network applications has been spawned, leveraging the inter-connection of social network users in fascinating ways, driving exponential growth in application volume, again with huge transaction volumes and data sizes (sometimes virtually overnight success stories). • Web- and advertising-analytics applications abound, crawling and analyzing virtually every aspect of the user interaction described above, again resulting in massive data sets with intense database access needs. • An entirely new breed of chatter trend analytics applications have emerged, analyzing things like Twitter tweets, Facebook chats, and so on, requiring massive levels of data storage and access.

4 Chapter 1: Introduction • Last, the world has gone mobile. In fact, in burgeoning economies and established countries alike, smart phones and tablets are by far the most readily available, high-growth, and commonly used communication vehicle for much of the world’s population, generating a nearly incomprensible stream of data, transactions, application interaction, and messaging volume (with no end in sight).

Any one of the factors above would, in itself, have created unprecedented growth in databases, but taken collectively the impact is truly awe-inspiring and overwhelming. I never cease to be amazed by the number of extremely bright people I meet that have an ever-widening set of ideas—ideas that are building new businesses that are further fueling the growth of databases to support these fast- growing industries and sectors.

YOUR ROLE AS A DATA ARCHITECT For those of us who are data architects, this is the most exciting time ever. Make no mistake about it: You are the key to the success of this new environment. For without the talent and dedication of the individual data architect to find solutions to address today’s data explosion, none of these ventures would succeed.

If you are a database architect or database administrator (DBA)—one who is driven to deliver the fastest and most scalable database platform possible for your organization—this book is addressed to you. In the end, it is your ability and intelligence that will solve these problems for real; all the technology in the world can at best assist you to do the job.

THE ACCELERATION OF BIG DATA INNOVATION The phenomenon of the Big Data explosion has not only yielded huge growth in data, but also has spurred ingenious innovation to conquer the Big Data challenge.

The sheer number of new database platforms and DBMS engines introduced in the past few years is mind-boggling. Daunting as it may seem to stay current with this rate of technological advancement, the innovation itself is nothing but good for the professional data architect. This investment in Big Data solutions has yielded an incredibly wide array of technology options for you to consider.

While the sheer rate of innovation has created an environment that is often nothing short of exhilarating (at least for a database fanatic like me), it can at times be intimidating as well.

Chapter 1: Introduction 5 Therefore, knowing how to evaluate these technologies so you can pick the right options for your application is critical. In fact, it’s a vital skill you (and the companies or organizations you support) cannot afford to be without. A portion of this series is devoted to a review of the various types of data platforms available, explaining typical use cases, available solutions, and how and where to apply them. But more importantly, the underlying concepts—applicable to all DBMS engines— are covered, enabling you to make informed evaluations and decisions for your specific application requirements.

I firmly believe (and have found through hard-won experience) that the more you understand the fundamentals of Big Data platforms and technologies, the better you will be at implementing truly successful database environments, taming whatever Big Data explosion you are in the midst of.

BACKGROUND FOR THIS BOOK My favorite subject is databases—pretty much everything about them, from database performance to management and of course scalability most of all. You might say I’m a bit of fanatic about it (okay, maybe I’m an extreme fanatic), but I wouldn’t do it if I didn’t love the field so much. I have been working intensively with all types of database technologies throughout my career. In one of my first companies, we built a successful independent database and application development firm, beginning with Sybase and expanding into other DBMS engines after that. More recently I ran Rogue Wave Software, a C++ tools vendor with a large concentration of Wall Street customers. While working with these customers, which included some of the top architects in the world, I saw a clear trend that spurred my interest even further: No matter how well an application could be scaled, the database tier was always the bottleneck. In fact, I learned that the database tier forms the last mile of application scalability, being at the very foundation of all functionality. The message was clear: Find a solution to that problem, and the potential was huge.

I have experienced the very best (and the very, very worst) that database management systems have to offer, in the trenches and firsthand. I’ve had the fortune and privilege of serving as the architect and manager for many mission- critical applications, from the advent of client-server technology to the Big Data systems of today. In addition to having lived through just about every database nightmare you can imagine, I’ve also seen the results when the database tier is performing outrageously well. I’ve seen how hard great software teams work to conquer these problems; there are few careers that require as much intelligence,

6 Chapter 1: Introduction persistence, and long, demanding hours. This has given me a profound respect for sharp technologists that conquer Big Data problems, the engines that are the very backbone of every application, not to mention the foundation of much of today’s economic growth. In fact, it’s obvious that the database explosion of the information age is upon us, and it’s likely to continue for many years to come as new and clever business opportunities arise to capitalize on this first-ever phenomenon encountered in world history.

Through this extensive and often intense experience, I’ve gained many insights over the years. My intent in writing this book is to share this experience with a wide audience, with the hope that you can avoid many of the common pitfalls I have seen repeated over and over. The book is written from a practical point of view, with what I believe are the salient points required to make the database tier of an application really cook. Many of the techniques can be applied easily, without the need to purchase or implement new technology, while others are intended as guides for taking full advantage of the latest and greatest database capabilities emerging from the fantastic technical advances made available in the past few years.

As CEO and CTO of CodeFutures (the makers of dbShards and AgilData technology), I have had the pleasure of working with the sharpest technical teams in my career—not just within our own engineering team, but also the customers that demanded the absolute best in database performance and reliability. We have worked with some of the world’s largest social networking applications, with phenomenal data growth and transaction volumes. Frankly, in the early days of our technology, we lived through a few genuinely hair-raising experiences, but with hard work in partnership with our customers, we always were able to succeed, even against the toughest challenges. More important, the wide array of applications where we have applied database sharding resulted in a broad range of experience, with large data sets and transaction volumes that were unimaginable just a few years ago. Many of these applications experienced extreme growth in transaction volume, an extraordinary thing to behold (to put it mildly). For example, watching a social game grow from nothing to over 1 million daily users in a few short months is a sight to see, and when you are part of the team responsible for ensuring the database platform handles the load, continuously running 24X7, it’s an experience that really grabs your attention (and can result in some very late nights!).

As a disclaimer right up front, many of the examples and experiences are derived from our work with our own products. dbShards, and our newest offering, AgilData, are most definitely commercial products, and of course part of my purpose in writing this book is to gain more attention for what we deliver. Having said that, I have worked hard to keep the book balanced, concentrating on the important facts

Chapter 1: Introduction 7 that drive database scalability irrespective of a particular DBMS engine or technology. In fact, the book discloses a lot of the “tricks” and designs we use to deliver such scalable performance, with the hope that you will find them beneficial and useful in your own database environment. And don’t worry, we have new capabilities coming out with our advanced AgilData technology that will blow the world away. Who knows, they may provide material for future books in this series.

WHY THE FOCUS ON DATABASE SHARDING? You will notice as you read this book that much of the focus is on database sharding, a technique and architecture for horizontal shared-nothing partitioning of database data across independent nodes or servers. From extensive experience and research, we have found that, regardless of what DBMS you use, database sharding is the only effective means for scaling a database. This applies to all types of databases, including the traditional relational database management system (RDBMS), NoSQL platforms, Big Data engines, Cloud databases, or databases of the so-called NewSQL paradigm. If you look closely at any of these offerings, particularly if they advertise a scalable platform, you will see that some sort of horizontal partitioning of data across nodes is used. Thus, database sharding is the answer, used in one way or another for virtually every leading database platform. Figure 1.2 shows the concept of database sharding, breaking a large monolithic database into multiple, smaller, sharded database instances across multiple servers.

Figure 1.2 Database sharding

Understanding the underlying principles of how database sharding works is the answer to controlling your own destiny regarding the ultimate performance of your

8 Chapter 1: Introduction Big Data implementation. The basic concepts behind sharding and how it works are not difficult, and once you grasp them you can implement amazing levels of scalability. Ultimately this book is about making you successful with your database tier, and the more you know about this concept, the more power you will have over defining, predicting, optimizing, and controlling the performance of your own databases.

NOTE Some vendors have downplayed the importance of database sharding, or have outright stated that sharding is an outmoded or useless idea. Nothing could be further from the truth, and once you understand the full meaning and applicability of the term, you will see that these very same vendors are using database sharding in their own technologies. Much more on the topic will be covered throughout the series.

SUMMARY I hope you find the ensuing pages interesting, informative, and most of all useful. If you gain even a single new idea that results in improved database performance for your application, I will have achieved what I set out to do with this work.

Chapter 1: Introduction 9 INDEX

Numbers Application server tier 24X7 availability. See also high-availability scaling traditional online applications, (HA) 54 author’s background with databases and, 7 in three-tier architecture, 49 in ideal infrastructure for Big Data, 90 Application servers redundancy and, 45 implementing horizontal scaling in Big Data infrastructure, 63–64 A round robin routing and, 48 Address book management, 34. See also sticky routing and, 48–49 FullContact Applications/apps AgilData, 7 mobile device, 31 Agile nature, of data structures, 90 social networking, 30 Analytics, challenges of managing Big Data, understanding analytics architecture, 26–27 50–53 Analytics applications understanding application architecture, architecture of, 50–53 46–50 scaling, 53 Architecture Apache Cassandra, 36 understanding analytics architecture, Apache server logs, 50–51 50–53 Apache Storm framework, 35–36 understanding application architecture, AppDynamics, system monitoring tools, 65 46–50 Application logic, separating from user interface, 49 B Application scalability Big Data streaming. See streaming detecting performance bottlenecks, 42–44 Big Data, what it is excitement at success of high-growth contemporary sources, 28–31 application, 41–42 formal definition, 25–27 fundamentals of, 42 future generators of, 31–32 goals of scalable application platforms, overview of, 24 39–41 practical definition, 27–28 reliability and, 44–45 summary, 32 scaling analytics applications, 53 traditional sources, 31 scaling out (horizontal scaling), 45 Bigtable parallel framework, 29 scaling traditional online applications, Black-box sharding 53–55 overview of, 83–86 scaling up (vertical scaling), 45 pros/cons of distributed data, 93 summary, 55 Black-box testing, 84 understanding analytics architecture and, Bottlenecks 50–53 database tier as, 6 understanding application architecture detecting, 42–44 and, 46–50 scalable database clusters and, 13

Index 101 B-trees federated search engine used by slow writes as cause of database FullContact for, 35 slowdown, 16–17 search engines and, 29 use in DBMS engines, 79 Cross-shard, database clusters, 13 Curation (management), challenges of C managing Big Data, 26 Cache/caching high CPU usage as indicator of D performance issue, 22 Data, relational. See relational data memory uses of DBMSs, 60 Data architects Cassandra (Apache), 36 role in managing Big Data, 5 Cloud use of data modeling by, 75 cost management and, 40 Data de-normalization, 97–98 pros/cons of, 41 Data explosion, 3–4 Social Point hosting games in, 36 Data locality, modeling database tier and, uses of database sharding and, 8 98–99 Clusters. See database clusters Data modeling Column databases, 80 critical to success, 74–76 Concurrency data locality and, 98–99 pros/cons of distributed data, 92–93 game application and, 97–98 read scaling and, 62–63 strategies for the scaling endgame, 96–98 Concurrency contention Data sets, limits on size of, 25–26 causes of database slowdown, 17–20 Data store (repository), in FullContact, 36 connection limits and, 19–20 Data structures, agile nature of, 90 overview of, 17 Database administrator (DBA), role in in single-treaded NoSQL engines, 19 managing Big Data, 5 in traditional RDMBSs, 18 Database clusters Connection limits, concurrency contention dynamic clusters in ideal infrastructure for and, 19–20 Big Data, 90 Consistent hashing high-availability (HA) and, 55 in black-box sharding, 84 implementing horizontal scaling and, 64 database sharding and, 82 performance bottlenecks and, 13 Contact Enrichment technique, in rebalancing, 94–95 FullContact, 34–35 required for working with Big Data, 27 Contact management, 34. See also scaling up database tier, 57 FullContact Database management system (DBMS) Contention. See concurrency contention author’s background with, 6–7 CPUs black-box sharding and, 86 detecting performance bottlenecks, 42–44 challenge of Big Data, 2–3 distributed operations and, 93 choices in, 69–70 high usage as indicator of performance CPU upgrade and, 60 issue, 22 data modeling and, 75 system metrics in detecting issues, 65–66 lesson relative to slowdown, 11–13 upgrading combined with vertical scaling, memory upgrade and, 60–61 59–60 performance degradation cycle Crawling/Web crawling (slowdown), 79 causes of data explosion, 4 relational. See relational database challenges of managing Big Data, 26 management system (RDBMS)

102 Index Database scalability laws of, 78–80 database tier and, 56–57 lesson relative to slowdown, 11–13 determining when to scale, 57–58 missing index causing slowdown, 20–21 horizontal scaling, 63–64 options for improving performance, 58–59 indicators of need, 65–66 scaling. See database scalability options for improving database slow reads indicating need for database performance, 58–59 scaling, 66–67 overview of, 56 slow writes causing slowdown, 16–17 read scaling, 62–63 slow writes indicating need for database slow reads indicating need, 66–67 scaling, 67–68 slow writes indicating need, 67–68 slowdown (degradation) curve, 10–11 upgrades combined with vertical scaling, time requirement for updating statistics, 59–62 21 vertical scaling, 59 traditional extraction of data from, 31 Database sharding DBA (database administrator), role in black-box sharding, 83–86 managing Big Data, 5 definition of, 80–83 DBMS. See database management system laws of databases and, 78–80 (DBMS) limitations of manual approach to, 80–81 DbShards, 7 overview of, 77 Degradation cycle. See slowdown partitioning and, 54–55 Denial-of-service (DoS) attacks, 52 reasons for focus on, 8–9 Developers, use of data modeling by, 75 relational sharding, 86–88 Disaster recovery (DR), 2 as response to database slowdown, 77–78 Disk I/O in Social Point, 37–38 detecting performance bottlenecks, 43–44 strategies for the scaling endgame, 96–98 indicators of performance issue, 22 summary, 88 slow reads indicating need for database virtual sharding, 38 scaling, 66–67 Database tier slow writes indicating need for database as bottleneck, 6 scaling, 67–68 in ideal infrastructure for Big Data, 90 system metrics in detecting issues, 65–66 modeling, 98–99 upgrading disk drive combined with scaling traditional online applications, 54– vertical scaling, 61–62 55 Disk reads. See reads in three-tier architecture, 50 Disk writes. See writes Databases Distributed data author’s background with, 6–8 database sharding and, 82 challenge of Big Data, 2–3 horizontal scaling and, 46 clustering. See database clusters ideals in scaling endgame, 96 common causes of slowdown, 14 indicator of when needed, 17 concurrency contention causing pros/cons of distributed operations, 91–94 slowdown, 17–20 Distributed joins, 93–94 future options for tapping into storage DoS (denial-of-service) attacks, 52 data, 31–32 DR (disaster recovery), 2 identifying issues that cause slowdown, Dragon City game, 36–37 21–22 Dynamic clusters, in ideal infrastructure for indicators of need for scaling, 65–66 Big Data, 90

Index 103 E implementing horizontal scaling, 64 Elastic scaling, 40 meeting the challenge of Big Data, 2 Embarrassingly parallel applications, 53 Horizontal scaling (scaling out) ETL (extract/transform/load), 31 in database sharding, 78–80 exceeding limits of single server and, 42 F in FullContact, 36 Facebook implementing, 63–64 one-to-many and many-to-many read scaling, 62–63 relationships in, 72 scaling applications, 45 social networks as source of Big Data, 29– in Social Point, 37 30 “Hype cycle,” new technology and, 33–34 Social Point as game provider on, 36 Failure points I parallel computing and, 45 Indexes/indexing scalable databases and, 55 as de-normalized structure, 98 Flexibility, of ideal infrastructure for Big memory use and, 60 Data, 90 missing index as cause of database Fragmented data, 98 slowdown, 20–21 FullContact read performance and, 77 data store (repository) in, 36 slow writes as cause of database facing challenge of Big Data, 35 slowdown, 16–17 horizontal scaling in, 36 updates causing slow writes, 67–68 overview of, 34–35 Infrastructure, ideal for Big Data, 90 streaming approach in, 35–36 Inherent scalability, of database tier, 53 In-memory DBMS engines, 80 G InnoDB engine, 11 Games Innovation, meeting the challenge of Big relational sharding and, 97–98 Data, 5–6 relationships in, 73–74 Instagram, 31 Social Point as game provider, 36 Internet as source of Big Data, 30 causes of data explosion, 4 Global economy, causes of data explosion, 4 search engines and, 29 Google, 29 Internet of Things (IoT), 32 Google+, 30 IoT (Internet of Things), 32 Graphical user interface (GUI), interacting with back-end DBMS, 12 J Java, data relationships in, 71 H Javascript Object Notation (JSON) HA. See high-availability (HA) Social Point using JSON objects, 37–38 Hadoop engines, 51–52 storing JSON structures, 75 Hash algorithms/hashing consistent hashing and, 82, 84 K rebalancing clusters, 95 Key range approach, to clustering, 95 shard key and, 82 Key-value stores, NoSQL/NewSQL, 84 Healthcare monitoring, 32 High-availability (HA). See also 24X7 L Linear scalability availability near-linear scalability in ideal database clusters and, 55 infrastructure, 90 in ideal infrastructure for Big Data, 90 shared-nothing architecture and, 83

104 Index Load, issues database cannot handle, 27–28 B-tree use for indexing, 79 Load balancer tier choices in DBMS engines, 69 round robin routing by, 48 concurrency contention in single-treaded scaling traditional online applications, 53 NoSQL engines, 19 sticky routing by, 48–49 data modeling and, 75 in three-tier architecture, 47 key-value stores, 84 Load testing, slowdown (degradation) curve MySQL as NoSQL data store, 37 example, 11 performance issues in, 14 Locks shared contention model, 20 concurrency contention. See concurrency contention O lesson relative to database slowdown, 11 Object-oriented languages, data relationships Logs, use of Apache server log in analytics, and, 71 50–51 One-to-many relationships, 72 Online advertising, causes of data explosion, M 4 Many-to-many relationships, 72 Online applications, scaling traditional, 53– MapReduce 55 choices in DBMS engines, 69 Operating systems (OSes) definition of, 52 identifying performance issues, 21–22 use of Apache server log in analytics, 51– memory swapping, 43 52 Oracle, 69 Meaning, data and, 70–72 Memory P detecting performance bottlenecks, 43 Parallel computing. See also database high usage as indicator of performance clusters issue, 22 addition of failure points, 45 system metrics in detecting issues, 65–66 embarrassingly parallel applications, 53 upgrading combined with vertical scaling, horizontal scaling and, 46 60 required for working with Big Data, 27 Memory leaks, 43 Partitioning Memory swapping, 43, 60–61 database sharding and, 54–55, 80 Metrics, analytics applications, 50 implementing horizontal scaling in Big Microsoft SQL Server, 69 Data infrastructure, 63–64 Mobile devices, as source of Big Data, 5, 31 Performance MySQL challenge of Big Data, 2–3 choices in DBMS engines, 69 degradation cycle (slowdown), 79 slowdown (degradation) curve example, detecting bottlenecks, 42–44 11 identifying issues, 21–22 Social Point using, 37 impact of single suboptimal query on, 13 indexes for read performance, 77 N in NoSQL/NewSQL, 14 Near-linear scalability, in ideal infrastructure options for improving, 58–59 for Big Data, 90 slowdown (degradation) curve, 10–11 Nesting, black-box sharding and, 85 table scans causing slowdown, 14–16 New Relic, system monitoring tools, 65 Player games NewSQL. See NoSQL/NewSQL black-box sharding and, 85 nmon utility, checking OS statistics with, 22 relational sharding and, 87 NoSQL/NewSQL Postgre SQL, 69

Index 105 Primary Key, in black-box sharding, 84 data modeling and, 75 Programming languages, 49 difficulty in using with Big Data, 25–27 relational nature of data and, 69 Q relational sharding and, 86–88 Queries Relational semantics, (one-to-many, many- impact of single suboptimal query on to-many, and scale-out), 35 performance, 13 Relational sharding memory uses of DBMSs, 60 game application, 97–98 missing index as cause of database overview of, 86–88 slowdown, 20–21 pros/cons of distributed data, 93 slow reads indicating need for database Relationships, importance in understanding scaling, 66–67 data, 73–74 table scans causing performance Reliability slowdown, 14–16 of ideal infrastructure for Big Data, 90 Query analyzer, 22 scaling applications and, 44–45 R Repository (data store), in FullContact, 36 RAID (Redundant Array of Independent Round robin routing, by load balancers, 48 Disks), 61–62 Rows, table scans causing performance RAM slowdown, 14–16 detecting performance bottlenecks, 43 S high memory usage as indicator of SANs (storage area networks), 61–62 performance issue, 22 Scaling memory upgrade and, 60–61 database sharding and, 8–9 Read scaling, 62–63 database tier as last mile in, 56–57 Reads detecting performance bottlenecks, 42–44 excessive disk I/O as indicator of determining when to scale databases, 57– performance issue, 22 58 indexes for read performance, 77 excitement from high-growth success, 41– slow reads indicating need for database 42 scaling, 66–67 fundamentals of, 42 slowdown (degradation) curve, 10–11 goals of scalable application platforms, upgrading disk drive combined with 39–41 vertical scaling, 61–62 horizontal scaling, 63–64 Rebalancing indicators of need for, 65–66 clusters, 94–95 options for improving database ideals in scaling endgame, 96 performance, 58–59 Redundancy, 24X7 availability and, 45 overview of, 39 Redundant Array of Independent Disks read scaling, 62–63 (RAID), 61–62 reliability and, 44–45 Relational data required for working with Big Data, 27 data modeling, 75–76 scaling analytics applications, 53 importance of relationships, 73–74 scaling out (horizontal scaling), 45 meaning of data, 70–72 scaling traditional online applications, 53– overview of, 69–70 55 summary, 75–76 scaling up (vertical scaling), 45 Relational database management system slow reads indicating need for, 66–67 (RDBMS) slow writes indicating need for, 67–68 concurrency contention in, 18

106 Index Scaling, continued partitioning, 8–9 in Social Point with shared-nothing sharding and, 37 sharding, 37 Sharing, challenges of managing Big Data, summary, 55, 68 26 understanding analytics architecture and, Shuffle join, 93–94 50–53 Single shard server operation understanding application architecture ideals in scaling endgame, 96 and, 46–50 pros/cons of distributed data, 91–92 upgrades combined with vertical scaling, Six Degrees of Separation, 72 59–62 Slowdown vertical scaling (scaling up), 59 challenge of Big Data and, 2 Scaling, endgame in common causes, 14 data locality and, 98–99 concurrency contention causing, 17–20 ideal circumstances, 96 database sharding as response to, 77–78 ideal infrastructure for Big Data, 90 identifying issues, 21–22 overview of, 89–90 indicating need for database scaling, 65– pros/cons of distributed operations, 91–94 66 rebalancing and, 94–95 lesson in, 11–13 sharding and data modeling strategies for, missing index as cause, 20–21 96–98 overview of, 10 summary, 99–100 performance degradation curve, 10–11 Scaling down, 40 performance degradation cycle, 79 Scatter-gather approach, black-box sharding slow reads, 66–67 and, 85 slow writes, 16–17, 67–68 Schema-less database design, data modeling summary, 23 and, 75–76 table scans causing, 14–16 Scientific data, traditional sources of Big Smart meters, 32 Data, 31 Smart phones, as source of Big Data, 31 Search engines Snapchat, 31 federated search engine used by Social games, 30 FullContact, 35 Social networks, as source of Big Data, 4, as source of Big Data, 29 29–30 Searches, challenges of managing Big Data, Social Point 26 clustering in, 37–38 SELECT queries, 21 overview of, 36–37 Server logs, use of Apache server log in scaling with shared-nothing sharding, 37 analytics, 50–51 Software Pipelines and SOA (Isaacson), 45 Service tier, application server tier and, 49 Solid-state drives (SSDs) Session state, 49 comparing with spinning disks, 43 Shard key. See also database sharding upgrading to, 22, 61–62 black-box sharding and, 86 Sources of Big Data hash algorithms and, 82 future generators, 31–32 rebalancing clusters and, 95 mobile devices, 31 routing messages based on, 35 overview of, 28 selecting sharding strategy, 96–97 search engines, 29 Shared contention model, 20 social networks, 29–30 Shared-nothing architecture traditional sources (system monitoring and linear scalability and, 83 scientific data), 31

Index 107 SQL SELECT query, 15–16 U SSDs. See solid-state drives (SSDs) UI (user interface), 49 State Upgrades, combining with vertical scaling, application server tier and, 49 59–62 database tier and, 50 Use cases, real world example of Big Data stateless nature of load balancers, 47 FullContact, 34–36 Sticky routing, by load balancers, 48–49 overview of, 33–34 Storage, challenges of managing Big Data, Social Point, 36–38 26 summary, 38 Storage area networks (SANs), 61–62 User interface (UI), 49 Storm framework (Apache), 35–36 Streaming V in FullContact, 35–36 Vertical scaling (scaling up) future options for tapping into storage application scalability, 45 data, 32 automating to meet growth, 40 System metrics combining upgrades with, 59–62 analytics applications, 50 database scalability, 59 indicating need for database scaling, 65– exceeding limits of single server and, 42 66 Virtual sharding, in Social Point, 38 System monitoring, traditional sources of Big Visualization, challenges of managing Big Data, 31 Data, 26 T W Table scans Web crawling. See crawling/Web crawling causes of database slowdown, 14–16 Web-analytics, 4 lesson relative to database slowdown, 11– World Wide Web, 4, 29 13 Writes upgrading disk drive combined with detecting performance bottlenecks, 43 vertical scaling, 61–62 distributed data and, 92 Tablet devices, 31 excessive disk I/O as indicator of Three-tier architecture performance issue, 22 application server tier, 49 performance degradation curve, 10–11 database tier, 50 scaling up (vertical scaling) and, 45 load balancer tier, 47–49 slow writes as cause of database overview of, 46–47 slowdown, 16–17 Transfer, managing Big Data, 26 slow writes indicating need for database Twitter scaling, 67–68 Apache Storm framework created by, 35 upgrading disk drive combined with causes of data explosion, 4, 29–30 vertical scaling, 61–62

108 Index