www.it-ebooks.info Making Sense of NoSQL
www.it-ebooks.info www.it-ebooks.info Making Sense of NoSQL A GUIDE FOR MANAGERS AND THE REST OF US
DAN MCCREARY ANN KELLY
MANNING SHELTER ISLAND
www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: [email protected]
©2014 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
Manning Publications Co. Development editor: Elizabeth Lexleigh 20 Baldwin Road Copyeditor: Benjamin Berg PO Box 261 Proofreader: Katie Tennant Shelter Island, NY 11964 Typesetter: Dottie Marsico Cover designer: Leslie Haimes
ISBN 9781617291074 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 18 17 16 15 14 13
www.it-ebooks.info To technology innovators and early adopters… those who shake up the status quo
We dedicate this book to people who understand the limitations of our current way of solving technology problems. They understand that by removing limitations, we can solve problems faster and at a lower cost and, at the same time, become more agile. Without these people, the NoSQL movement wouldn’t have gained the critical mass it needed to get off the ground.
Innovators and early adopters are the people within organizations who shake up the status quo by testing and evaluating new architectures. They initiate pilot projects and share their successes and failures with their peers. They use early versions of software and help shake out the bugs. They build new versions of NoSQL distributions from source and explore areas where new NoSQL solutions can be applied. They’re the people who give solution architects more options for solving business problems. We hope this book will help you to make the right choices.
www.it-ebooks.info www.it-ebooks.info brief contents
PART 1 INTRODUCTION ...... 1 1 ■ NoSQL: It’s about making intelligent choices 3 2 ■ NoSQL concepts 15
PART 2 DATABASE PATTERNS ...... 35 3 ■ Foundational data architecture patterns 37 4 ■ NoSQL data architecture patterns 62 5 ■ Native XML databases 96
PART 3 NOSQL SOLUTIONS ...... 125 6 ■ Using NoSQL to manage big data 127 7 ■ Finding information with NoSQL search 154 8 ■ Building high-availability solutions with NoSQL 172 9 ■ Increasing agility with NoSQL 192
PART 4 ADVANCED TOPICS ...... 207 10 ■ NoSQL and functional programming 209 11 ■ Security: protecting data in your NoSQL systems 232 12 ■ Selecting the right NoSQL solution 254
vii
www.it-ebooks.info www.it-ebooks.info contents
foreword xvii preface xix acknowledgments xxi about this book xxii
PART 1 INTRODUCTION...... 1 1 NoSQL: It’s about making intelligent choices 3 1.1 What is NoSQL? 4 1.2 NoSQL business drivers 6 Volume 7 ■ Velocity 7 ■ Variability 7 ■ Agility 8 1.3 NoSQL case studies 8 Case study: LiveJournal’s Memcache 9 ■ Case study: Google’s MapReduce—use commodity hardware to create search indexes 10 Case study: Google’s Bigtable—a table with a billion rows and a million columns 11 ■ Case study: Amazon’s Dynamo—accept an order 24 hours a day, 7 days a week 11 ■ Case study: MarkLogic 12 Applying your knowledge 12 1.4 Summary 13
ix
www.it-ebooks.info x CONTENTS 2 NoSQL concepts 15 2.1 Keeping components simple to promote reuse 15 2.2 Using application tiers to simplify design 17 2.3 Speeding performance by strategic use of RAM, SSD, and disk 21 2.4 Using consistent hashing to keep your cache current 22 2.5 Comparing ACID and BASE—two methods of reliable database transactions 24 RDBMS transaction control using ACID 25 ■ Non-RDBMS transaction control using BASE 27 2.6 Achieving horizontal scalability with database sharding 28 2.7 Understanding trade-offs with Brewer’s CAP theorem 30 2.8 Apply your knowledge 32 2.9 Summary 33 2.10 Further reading 33
PART 2 DATABASE PATTERNS ...... 35 3 Foundational data architecture patterns 37 3.1 What is a data architecture pattern? 38 3.2 Understanding the row-store design pattern used in RDBMSs 39 How row stores work 39 ■ Row stores evolve 41 Analyzing the strengths and weaknesses of the row-store pattern 42 3.3 Example: Using joins in a sales order 43 3.4 Reviewing RDBMS implementation features 45 RDBMS transactions 45 ■ Fixed data definition language and typed columns 47 ■ Using RDBMS views for security and access control 48 ■ RDBMS replication and synchronization 49 3.5 Analyzing historical data with OLAP, data warehouse, and business intelligence systems 51 How data flows from operational systems to analytical systems 52 Getting familiar with OLAP concepts 54 ■ Ad hoc reporting using aggregates 55
www.it-ebooks.info CONTENTS xi
3.6 Incorporating high availability and read-mostly systems 57 3.7 Using hash trees in revision control systems and database synchronization 58 3.8 Apply your knowledge 60 3.9 Summary 60 3.10 Further reading 61 4 NoSQL data architecture patterns 62 4.1 Key-value stores 63 What is a key-value store? 64 ■ Benefits of using a key-value store 65 ■ Using a key-value store 68 ■ Use case: storing web pages in a key-value store 70 ■ Use case: Amazon simple storage service (S3) 71 4.2 Graph stores 72 Overview of a graph store 72 ■ Linking external data with the RDF standard 74 ■ Use cases for graph stores 75 4.3 Column family (Bigtable) stores 81 Column family basics 82 ■ Understanding column family keys 82 ■ Benefits of column family systems 83 Case study: storing analytical information in Bigtable 85 Case study: Google Maps stores geographic information in Bigtable 85 ■ Case study: using a column family to store user preferences 86 4.4 Document stores 86 Document store basics 87 ■ Document collections 88 Application collections 88 ■ Document store APIs 89 Document store implementations 89 ■ Case study: ad server with MongoDB 90 ■ Case study: CouchDB, a large-scale object database 91 4.5 Variations of NoSQL architectural patterns 91 Customization for RAM or SSD stores 92 ■ Distributed stores 92 Grouping items 93 4.6 Summary 95 4.7 Further reading 95
www.it-ebooks.info xii CONTENTS 5 Native XML databases 96 5.1 What is a native XML database? 97 5.2 Building applications with a native XML database 100 Loading data can be as simple as drag-and-drop 101 ■ Using collections to group your XML documents 102 ■ Applying simple queries to transform complex data with XPath 105 Transforming your data with XQuery 106 ■ Updating documents with XQuery updates 109 ■ XQuery full-text search standards 110 5.3 Using XML standards within native XML databases 110 5.4 Designing and validating your data with XML Schema and Schematron 112 XML Schema 112 ■ Using Schematron to check document rules 113 5.5 Extending XQuery with custom modules 115 5.6 Case study: using NoSQL at the Office of the Historian at the Department of State 115 5.7 Case study: managing financial derivatives with MarkLogic 119 Why financial derivatives are difficult to store in RDBMSs 119 An investment bank switches from 20 RDBMSs to one native XML system 119 ■ Business benefits of moving to a native XML document store 121 ■ Project results 122 5.8 Summary 122 5.9 Further reading 123
PART 3 NOSQL SOLUTIONS ...... 125 6 Using NoSQL to manage big data 127 6.1 What is a big data NoSQL solution? 128 6.2 Getting linear scaling in your data center 132 6.3 Understanding linear scalability and expressivity 133 6.4 Understanding the types of big data problems 135 6.5 Analyzing big data with a shared-nothing architecture 136 6.6 Choosing distribution models: master-slave versus peer-to-peer 137
www.it-ebooks.info CONTENTS xiii
6.7 Using MapReduce to transform your data over distributed systems 139 MapReduce and distributed filesystems 140 ■ How MapReduce allows efficient transformation of big data problems 142
6.8 Four ways that NoSQL systems handle big data problems 143 Moving queries to the data, not data to the queries 143 ■ Using hash rings to evenly distribute data on a cluster 144 ■ Using replication to scale reads 145 ■ Letting the database distribute queries evenly to data nodes 146 6.9 Case study: event log processing with Apache Flume 146 Challenges of event log data analysis 147 ■ How Apache Flume works to gather distributed event data 148 ■ Further thoughts 149 6.10 Case study: computer-aided discovery of health care fraud 150 What is health care fraud detection? 150 ■ Using graphs and custom shared-memory hardware to detect health care fraud 151 6.11 Summary 152 6.12 Further reading 153 7 Finding information with NoSQL search 154 7.1 What is NoSQL search? 155 7.2 Types of search 155 Comparing Boolean, full-text keyword, and structured search models 155 ■ Examining the most common types of search 156 7.3 Strategies and methods that make NoSQL search effective 158 7.4 Using document structure to improve search quality 161 7.5 Measuring search quality 162 7.6 In-node indexes versus remote search services 163 7.7 Case study: using MapReduce to create reverse indexes 164 7.8 Case study: searching technical documentation 166 What is technical document search? 166 ■ Retaining document structure in a NoSQL document store 167
www.it-ebooks.info xiv CONTENTS
7.9 Case study: searching domain-specific languages— findability and reuse 168 7.10 Apply your knowledge 170 7.11 Summary 170 7.12 Further reading 171 8 Building high-availability solutions with NoSQL 172 8.1 What is a high-availability NoSQL database? 173 8.2 Measuring availability of NoSQL databases 174 Case study: the Amazon’s S3 SLA 176 ■ Predicting system availability 176 ■ Apply your knowledge 177 8.3 NoSQL strategies for high availability 178 Using a load balancer to direct traffic to the least busy node 178 ■ Using high-availability distributed filesystems with NoSQL databases 179 ■ Case study: using HDFS as a high-availability filesystem to store master data 180 Using a managed NoSQL service 182Case study: using Amazon DynamoDB for a high-availability data store 182 8.4 Case study: using Apache Cassandra as a high-availability column family store 184 Configuring data to node mappings with Cassandra 185 8.5 Case study: using Couchbase as a high-availability document store 187 8.6 Summary 189 8.7 Further reading 190 9 Increasing agility with NoSQL 192 9.1 What is software agility? 193 Apply your knowledge: local or cloud-based deployment? 195 9.2 Measuring agility 196 9.3 Using document stores to avoid object-relational mapping 199 9.4 Case study: using XRX to manage complex forms 201 What are complex business forms? 201 ■ Using XRX to replace client JavaScript and object-relational mapping 202 Understanding the impact of XRX on agility 205
www.it-ebooks.info CONTENTS xv
9.5 Summary 205 9.6 Further reading 206
PART 4 ADVANCED TOPICS...... 207 10 NoSQL and functional programming 209 10.1 What is functional programming? 210 Imperative programming is managing program state 211 ■ Functional programming is parallel transformation without side effects 213 ■ Comparing imperative and functional programming at scale 216 ■ Using referential transparency to avoid recalculating transforms 217 10.2 Case study: using NetKernel to optimize web page content assembly 219 Assembling nested content and tracking component dependencies 219 ■ Using NetKernel to optimize component regeneration 220 10.3 Examples of functional programming languages 222 10.4 Making the transition from imperative to functional programming 223 Using functions as a parameter of a function 223 ■ Using recursion to process unstructured document data 224 ■ Moving from mutable to immutable variables 224 ■ Removing loops and conditionals 224 ■ The new cognitive style: from capturing state to isolated transforms 225 ■ Quality, validation, and consistent unit testing 225 ■ Concurrency in functional programming 226 10.5 Case study: building NoSQL systems with Erlang 226 10.6 Apply your knowledge 229 10.7 Summary 230 10.8 Further reading 231 11 Security: protecting data in your NoSQL systems 232 11.1 A security model for NoSQL databases 233 Using services to mitigate the need for in-database security 235 Using data warehouses and OLAP to mitigate the need for in-database security 235 ■ Summary of application versus database-layer security benefits 236
www.it-ebooks.info xvi CONTENTS
11.2 Gathering your security requirements 237 Authentication 237 ■ Authorization 240 ■ Audit and logging 242 ■ Encryption and digital signatures 243 Protecting pubic websites from denial of service and injection attacks 245 11.3 Case Study: access controls on key-value store— Amazon S3 246 Identity and Access Management (IAM) 247 ■ Access-control lists (ACL) 247 ■ Bucket policies 248 11.4 Case study: using key visibility with Apache Accumulo 249 11.5 Case study: using MarkLogic’s RBAC model in secure publishing 250 Using the MarkLogic RBAC security model to protect documents 250 ■ Using MarkLogic in secure publishing 251 Benefits of the MarkLogic security model 252 11.6 Summary 252 11.7 Further reading 253 12 Selecting the right NoSQL solution 254 12.1 What is architecture trade-off analysis? 255 12.2 Team dynamics of database architecture selection 257 Selecting the right team 258 ■ Accounting for experience bias 259 ■ Using outside consultants 259 12.3 Steps in architectural trade-off analysis 260 12.4 Analysis through decomposition: quality trees 263 Sample quality attributes 264 ■ Evaluating hybrid and cloud architectures 266 12.5 Communicating the results to stakeholders 267 Using quality trees as navigational maps 267 ■ Apply your knowledge 269 ■ Using quality trees to communicate project risks 270 12.6 Finding the right proof-of-architecture pilot project 271 12.7 Summary 273 12.8 Further reading 274 index 275
www.it-ebooks.info foreword
Where does one start to explain a topic that’s defined by what it isn’t, rather than what it is? Believe me, as someone who’s been trying to educate people in this field for the past three years, it’s a frustrating dilemma, and one shared by lots of technical experts, consultants, and vendors. Even though few think the name NoSQL is optimal, almost everyone seems to agree that it defines a category of products and technologies better than any other term. My best advice is to let go of whatever hang-ups you might have about the semantics, and just choose to learn about something new. And trust me please…the stuff you’re about to learn is worth your time. Some brief personal context up front: as a publisher in the world of information management, I had heard the term NoSQL, but had little idea of its significance until three years ago, when I ran into Dan McCreary in the corridor of a conference in Toronto. He told me a bit about his current project and was obviously excited about the people and technologies he was working with. He convinced me in no time that this NoSQL thing was going to be huge, and that someone in my position should learn as much as I could about it. It was excellent advice, and we’ve had a wonderful partnership since then, running a conference together, doing webinars, and writing white papers. Dan was spot on…this NoSQL stuff is exciting, and the people in the community are quite brilliant. Like most people who work in arcane fields, I often find myself trying to explain complex things in simple terms for the benefit of those who don’t share the same pas- sion or context that I have. And even when you understand the value of the perfect elevator pitch, or desperately want to explain what you do to your mother, the right explanation can be elusive. Sometimes it’s even more difficult to explain new things to
xvii
www.it-ebooks.info xviii FOREWORD
people who have more knowledge, rather than less. Specifically in terms of NoSQL, that’s the huge community of relational DBMS devotees who’ve existed happily and efficiently for the past 30 years, needing nothing but one toolkit. That’s where Making Sense of NoSQL comes in. If you’re in an enterprise computing role and trying to understand the value of NoSQL, then you’re going to appreciate this book, because it speaks directly to you. Sure, you startup guys will get something out of it, but for enterprise IT folks, the barriers are pretty daunting—not the least of which will be the many years of technical bias accumulated against you from the peo- ple in your immediate vicinity, wondering why the heck you’d want to put your data into anything but a nice, orderly table. The authors understand this, and have focused a lot of their analysis on the techni- cal and architectural trade-offs that you’ll be facing. I also love that they’ve under- taken so much effort to offer case studies throughout the book. Stories are key to persuasion, and these examples drawn from real applications provide a storyline to the subject that will be invaluable as you try to introduce these new technologies into your organization. Dan McCreary and Ann Kelly have provided the first comprehensive explanation of what NoSQL technologies are, and why you might want to use them in a corporate context. While this is not meant to be a technical book, I can tell you that behind the scenes they’ve been diligent about consulting with the product architects and devel- opers to ensure that the nuances and features of different products are represented accurately. Making Sense of NoSQL is a handbook of easily digestible, practical advice for techni- cal managers, architects, and developers. It’s a guide for anyone who needs to under- stand the full range of their data management options in the increasingly complex and demanding world of big, fast data. The title of chapter 1 is “NoSQL: It’s about making intelligent choices,” and based on your selection of this book, I can confirm that you’ve made one already.
TONY SHAW FOUNDER AND CEO DATAVERSITY
www.it-ebooks.info preface
Sometimes we’re presented with facts that force us to reassess what we think we know. After spending most of our working life performing data modeling tasks with a focus on storing data in rows, we learned that the modeling process might not be necessary. While this information didn’t mean our current knowledge was invalid, it forced us to take a hard look at how we solved business technology problems. Armed with new knowledge, techniques, and problem-solving styles, we broadened the repertoire of our solution space. In 2006, while working on a project that involved the exchange of real estate trans- actions, we spent many months designing XML schemas and forms to store the com- plex hierarchies of data. On the advice of a friend (Kurt Cagle), we found that storing the data into a native XML database saved our project months of object modeling, relational database design, and object-relational mapping. The result was a radically simple architecture that could be maintained by nonprogammers. The realization that enterprise data can be stored in structures other than RDBMSs is a major turning point for people who enter the NoSQL space. Initially, this informa- tion may be viewed with skepticism, fear, and even self-doubt. We may question our own skills as well as the educational institutions that trained us and the organizations that reinforce the notion that RDBMS and objects are the only way to solve problems. Yet if we’re going to be fair to our clients, customers, and users, we must take a holistic approach to find the best fit for each business problem and evaluate other database architectures. In 2010, frustrated with the lack of exposure NoSQL databases were getting at large enterprise data conferences, we approached Tony Shaw from DATAVERSITY
xix
www.it-ebooks.info xx PREFACE
about starting a new conference. The conference would be a venue for anyone inter- ested in learning about NoSQL technologies and exposing individuals and organiza- tions to the NoSQL databases available to them. The first NoSQL Now! conference was successfully held in San Jose, California, in August of 2011, with approximately 500 interested and curious attendees. One finding of the conference was that there was no single source of material that covered NoSQL architectures or introduced a process to objectively match a business problem with the right database. People wanted more than a collection of “Hello World!” examples from open source projects. They were looking for a guide that helped them match a business problem to an architecture first, and then a process that allowed them to consider open source as well as commercial database systems. Finding a publisher that would use our existing DocBook content was the first step. Luckily, we found that Manning Publications understands the value of standards.
www.it-ebooks.info acknowledgments
We’d like to thank everyone at Manning Publications who helped us take our raw ideas and transform them into a book: Michael Stephens, who brought us on board; Elizabeth Lexleigh, our development editor, who patiently read version after version of each chapter; Nick Chase, who made all the technology work like it’s supposed to; the marketing and production teams, and everyone who worked behind the scenes— we acknowledge your efforts, guidance, and words of encouragement. To the many people who reviewed case studies and provided us with examples of real-world NoSQL usage—we appreciate your time and expertise: George Bina, Ben Brumfield, Dipti Borkar, Kurt Cagle, Richard Carlsson, Amy Friedman, Randolph Kahle, Shannon Kempe, Amir Halfon, John Hudzina, Martin Logan, Michaline Todd, Eric Merritt, Pete Palmer, Amar Shan, Christine Schwartz, Tony Shaw, Joe Wicen- towski, Melinda Wilken, and Frank Weige. To the reviewers who contributed valuable insights and feedback during the devel- opment of our manuscript—our book is better for your input: Aldrich Wright, Bran- don Wilhite, Craig Smith, Gabriela Jack, Ian Stirk, Ignacio Lopez Vellon, Jason Kolter, Jeff Lehn, John Guthrie, Kamesh Sampah, Michael Piscatello, Mikkel Eide Eriksen, Philipp K. Janert, Ray Lugo, Jr., Rodrigo Abreu, and Roland Civet. We’d like to say a special thanks to our friend Alex Bleasdale for providing us with working code to support the role-based, access-control case study in our chapter on NoSQL security and secure document publishing. Special thanks also to Tony Shaw for contributing the foreword, and to Leo Polovets for his technical proofread of the final manuscript shortly before it went to production.
xxi
www.it-ebooks.info about this book
In writing this book, we had two goals: first, to describe NoSQL databases, and second, to show how NoSQL systems can be used as standalone solutions or to augment cur- rent SQL systems to solve business problems. We invite anyone who has an interest in learning about NoSQL to use this book as a guide. You’ll find that the information, examples, and case studies are targeted toward technical managers, solution archi- tects, and data architects who have an interest in learning about NoSQL. This material will help you objectively evaluate SQL and NoSQL database systems to see which business problems they solve. If you’re looking for a programming guide for a particular product, you’ve come to wrong place. In this book you’ll find informa- tion about the motivations behind NoSQL, as well as related terminology and con- cepts. There might be sections and chapters of this book that cover topics you already understand; feel free to skim or skip over them and focus on the unknown. Finally, we feel strongly about and focus on standards. The standards associated with SQL systems allow applications to be ported between databases using a common language. Unfortunately, NoSQL systems can’t yet make this claim. In time, NoSQL application vendors will pressure NoSQL database vendors to adopt a set of standards to make them as portable as SQL.
Roadmap This book is divided into four parts. Part 1 sets the stage by defining NoSQL and reviewing the basic concepts behind the NoSQL movement. In chapter 1, “NoSQL: It’s about making intelligent choices,” we define the term NoSQL, talk about the key events that triggered the NoSQL movement, and present a
xxii
www.it-ebooks.info ABOUT THIS BOOK xxiii high-level view of the business benefits of NoSQL systems. Readers already familiar with the NoSQL movement and the business benefits might choose to skim this chapter. In chapter 2, “NoSQL concepts,” we introduce the core concepts associated with the NoSQL movement. Although you can skim this chapter on a first read-through, it’s important for understanding material in later chapters. We encourage you to use this chapter as a reference guide as you encounter these concepts throughout the book.
In part 2, “Database patterns,” we do an in-depth review of SQL and NoSQL database architecture patterns. We look at the different database structures and how we access them, and present use cases to show the types of situations where each architectural pattern is best used. Chapter 3 covers “Foundational data architecture patterns.” It begins with a review of the drivers behind RDBMSs and how the requirements of ERP systems shaped the features we have in current RDBMS and BI/DW systems. We briefly discuss other data- base systems such as object databases and revision control systems. You can skim this chapter if you’re already familiar with these systems. In chapter 4, “NoSQL data architecture patterns,” we introduce the database pat- terns associated with NoSQL. We look at key-value stores, graph stores, column family (Bigtable) systems, and document databases. The chapter provides definitions, exam- ples, and case studies to facilitate understanding. Chapter 5 covers “Native XML databases,” which are most often found in govern- ment and publishing applications, as they are known to lower costs and support the use of standards. We present two case studies from the financial and government pub- lishing areas.
In part 3, we look at how NoSQL systems can be applied to the problems of big data, search, high availability, and agile web development. In chapter 6, “Using NoSQL to manage big data,” you’ll see how NoSQL systems can be configured to efficiently process large volumes of data running on commodity hardware. We include a discussion on distributed computing and horizontal scalabil- ity, and present a case study where commodity hardware fails to scale for analyzing large graphs. In chapter 7, “Finding information with NoSQL search,” you’ll learn how to improve search quality by implementing a document model and preserving the docu- ment’s content. We discuss how MapReduce transforms are used to create scalable reverse indexes, which result in fast search. We review the search systems used on doc- uments and databases and show how structured search solutions are used to create accurate search result rankings. Chapter 8 covers “Building high-availability solutions with NoSQL.” We show how the replicated and distributed nature of NoSQL systems can be used to result in sys- tems that have increased availability. You’ll see how many low-cost CPUs can provide higher uptime once data synchronization technologies are used. Our case study shows
www.it-ebooks.info xxiv ABOUT THIS BOOK
how full peer-to-peer architectures can provide higher availability than other distribu- tion models. In chapter 9, we talk about “Increasing agility with NoSQL.” By eliminating the object-relational mapping layer, NoSQL software development is simpler and can quickly adapt to changing business requirements. You’ll see how these NoSQL systems allow the experienced developer, as well as nonprogramming staff, to become part of the software development lifecycle process.
In part 4, we cover the “Advanced topics” of functional programming and security, and then review a formalized process for selecting the right NoSQL system. In chapter 10, we cover the topic of “NoSQL and functional programming” and the need for distributed transformation architectures such as MapReduce. We look at how functional programming has influenced the ability of NoSQL solutions to use large numbers of low-cost processors and why several NoSQL databases use actor- based systems such as Erlang. We also show how functional programming and resource-oriented programming can be combined to create scalable performance on distributed systems with a case study of the NetKernel system. Chapter 11 covers the topic of “Security: protecting data in your NoSQL systems.” We review the history and key security considerations that are common to NoSQL solutions. We provide examples of how a key-value store, a column family store, and a document store can implement a robust security model. In chapter 12, “Selecting the right NoSQL solution,” we walk through a formal process that organizations can use to select the right database for their business prob- lem. We close with some final thoughts and information about how these technologies will impact business system selection.
Code conventions and downloads Source code in listings or in text is in a fixed-width font like this to separate it from ordinary text. You can download the source code for the listings from the Manning website, www.manning.com/MakingSenseofNoSQL.
Author Online The purchase of Making Sense of NoSQL includes free access to a private web forum run by Manning Publications, where you can make comments about the book, ask techni- cal questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your web browser to www.manning.com/Making- SenseofNoSQL. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of
www.it-ebooks.info ABOUT THIS BOOK xxv the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray! The Author Online forum and the archives of previous discussions will be accessi- ble from the publisher’s website as long as the book is in print.
About the authors
DAN MCCREARY is a data architecture consultant with a strong interest in standards. He has worked for organizations such as Bell Labs (integrated circuit design), the supercomputing industry (porting UNIX) and Steve Job’s NeXT Computer (software evangelism), as well as founded his own consulting firm. Dan started working with US federal data standards in 2002 and was active in the adoption of the National Informa- tion Exchange Model (NIEM). Dan started doing NoSQL development in 2006 when he was exposed to native XML databases for storing form data. He has served as an invited expert on the World Wide Web XForms standard group and is a cofounder of the NoSQL Now! Conference.
ANN KELLY is a software consultant with Kelly McCreary & Associates. After spending much of her career working in the insurance industry developing software and man- aging projects, she became a NoSQL convert in 2011. Since then, she has worked with her customers to create NoSQL solutions that allow them to solve their business prob- lems quickly and efficiently while providing them with the training to manage their own applications.
www.it-ebooks.info www.it-ebooks.info Part 1 Introduction
In part 1 we introduce you to the topic of NoSQL. We define the term NoSQL, talk about why the NoSQL movement got started, look at the core topics, and review the business benefits of including NoSQL solutions in your organization. In chapter 1 we begin by defining NoSQL and talk about the business drivers and motivations behind the NoSQL movement. Chapter 2 expands on the foun- dation in chapter 1 and provides a review of the core concepts and important definitions associated with NoSQL. If you’re already familiar with the NoSQL movement, you may want to skim chapter 1. Chapter 2 contains core concepts and definitions associated with NoSQL. We encourage everyone to read chapter 2 to gain an understanding of these concepts, as they’ll be referenced often and applied throughout the book.
www.it-ebooks.info www.it-ebooks.info NoSQL: It’s about making intelligent choices
This chapter covers
What’s NoSQL? NoSQL business drivers NoSQL case studies
The complexity for minimum component costs has increased at a rate of roughly a factor of two per year...Certainly over the short term this rate can be expected to continue, if not to increase. —Gordon Moore, 1965 …Then you better start swimmin’…Or you’ll sink like a stone…For the times they are a-changin’. —Bob Dylan
In writing this book we have two goals: first, to describe NoSQL databases, and sec- ond, to show how NoSQL systems can be used as standalone solutions or to aug- ment current SQL systems to solve business problems. Though we invite anyone who has an interest in NoSQL to use this as a guide, the information, examples,
3
www.it-ebooks.info 4 CHAPTER 1 NoSQL: It’s about making intelligent choices
and case studies are targeted toward technical managers, solution architects, and data architects who are interested in learning about NoSQL. This material will help you objectively evaluate SQL and NoSQL database systems to see which business problems they solve. If you’re looking for a programming guide for a particular product, you’ve come to the wrong place. Here you’ll find informa- tion about the motivations behind NoSQL, as well as related terminology and con- cepts. There may be sections and chapters of this book that cover topics you already understand; feel free to skim or skip over them and focus on the unknown. Finally, we feel strongly about and focus on standards. The standards associated with SQL systems allow applications to be ported between databases using a common language. Unfortunately, NoSQL systems can’t yet make this claim. In time, NoSQL application vendors will pressure NoSQL database vendors to adopt a set of standards to make them as portable as SQL. In this chapter, we’ll begin by giving a definition of NoSQL. We’ll talk about the business drivers and motivations that make NoSQL so intriguing to and popular with organizations today. Finally, we’ll look at five case studies where organizations have successfully implemented NoSQL to solve a particular business problem. 1.1 What is NoSQL? One of the challenges with NoSQL is defining it. The term NoSQL is problematic since it doesn’t really describe the core themes in the NoSQL movement. The term origi- nated from a group in the Bay Area who met regularly to talk about common con- cerns and issues surrounding scalable open source databases, and it stuck. Descriptive or not, it seems to be everywhere: in trade press, product descriptions, and confer- ences. We’ll use the term NoSQL in this book as a way of differentiating a system from a traditional relational database management system (RDBMS). For our purpose, we define NoSQL in the following way: NoSQL is a set of concepts that allows the rapid and efficient processing of data sets with a focus on performance, reliability, and agility. Seems like a broad definition, right? It doesn’t exclude SQL or RDBMS systems, right? That’s not a mistake. What’s important is that we identify the core themes behind NoSQL, what it is, and most importantly what it isn’t. So what is NoSQL?
It’s more than rows in tables—NoSQL systems store and retrieve data from many formats: key-value stores, graph databases, column-family (Bigtable) stores, doc- ument stores, and even rows in tables. It’s free of joins—NoSQL systems allow you to extract your data using simple interfaces without joins. It’s schema-free—NoSQL systems allow you to drag-and-drop your data into a folder and then query it without creating an entity-relational model.
www.it-ebooks.info What is NoSQL? 5
It works on many processors—NoSQL systems allow you to store your database on multiple processors and maintain high-speed performance. It uses shared-nothing commodity computers—Most (but not all) NoSQL systems leverage low-cost commodity processors that have separate RAM and disk. It supports linear scalability—When you add more processors, you get a consistent increase in performance. It’s innovative—NoSQL offers options to a single way of storing, retrieving, and manipulating data. NoSQL supporters (also known as NoSQLers) have an inclu- sive attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL community, NoSQL means “Not only SQL.” Equally important is what NoSQL is not:
It’s not about the SQL language—The definition of NoSQL isn’t an application that uses a language other than SQL. SQL as well as other query languages are used with NoSQL databases. It’s not only open source—Although many NoSQL systems have an open source model, commercial products use NOSQL concepts as well as open source initia- tives. You can still have an innovative approach to problem solving with a com- mercial product. It’s not only big data—Many, but not all, NoSQL applications are driven by the inability of a current application to efficiently scale when big data is an issue. Though volume and velocity are important, NoSQL also focuses on variability and agility. It’s not about cloud computing—Many NoSQL systems reside in the cloud to take advantage of its ability to rapidly scale when the situation dictates. NoSQL sys- tems can run in the cloud as well as in your corporate data center. It’s not about a clever use of RAM and SSD—Many NoSQL systems focus on the effi- cient use of RAM or solid state disks to increase performance. Though this is important, NoSQL systems can run on standard hardware. It’s not an elite group of products—NoSQL isn’t an exclusive club with a few prod- ucts. There are no membership dues or tests required to join. To be considered a NoSQLer, you only need to convince others that you have innovative solutions to their business problems. NoSQL applications use a variety of data store types (databases). From the simple key- value store that associates a unique key with a value, to graph stores used to associate relationships, to document stores used for variable data, each NoSQL type of data store has unique attributes and uses as identified in table 1.1.
www.it-ebooks.info 6 CHAPTER 1 NoSQL: It’s about making intelligent choices
Table 1.1 Types of NoSQL data stores—the four main categories of NoSQL systems, and sample products for each data store type
Type Typical usage Examples
Key-value store—A simple data stor- • Image stores • Berkeley DB age system that uses a key to access • Key-based filesystems • Memcache a value • Object cache • Redis • Systems designed to scale •Riak • DynamoDB
Column family store—A sparse matrix • Web crawler results • Apache HBase system that uses a row and a column • Big data problems that can • Apache Cassandra as keys relax consistency rules • Hypertable • Apache Accumulo
Graph store—For relationship- • Social networks • Neo4j intensive problems • Fraud detection • AllegroGraph • Relationship-heavy data • Bigdata (RDF data store) • InfiniteGraph (Objectivity)
Document store—Storing hierarchical • High-variability data • MongoDB (10Gen) data structures directly in the data- • Document search • CouchDB base • Integration hubs • Couchbase • Web content management • MarkLogic • Publishing • eXist-db • Berkeley DB XML
NoSQL systems have unique characteristics and capabilities that can be used alone or in conjunction with your existing systems. Many organizations considering NoSQL sys- tems do so to overcome common issues such as volume, velocity, variability, and agility, the business drivers behind the NoSQL movement. 1.2 NoSQL business drivers The scientist-philosopher Thomas Kuhn coined the term paradigm shift to identify a recurring process he observed in science, where innovative ideas came in bursts and impacted the world in nonlinear ways. We’ll use Kuhn’s concept of the paradigm shift as a way to think about and explain the NoSQL movement and the changes in thought patterns, architectures, and methods emerging today. Many organizations supporting single-CPU relational systems have come to a cross- roads: the needs of their organizations are changing. Businesses have found value in rapidly capturing and analyzing large amounts of variable data, and making immedi- ate changes in their businesses based on the information they receive. Figure 1.1 shows how the demands of volume, velocity, variability, and agility play a key role in the emergence of NoSQL solutions. As each of these drivers applies pres- sure to the single-processor relational model, its foundation becomes less stable and in time no longer meets the organization’s needs.
www.it-ebooks.info NoSQL business drivers 7
Volume
Figure 1.1 In this figure, we see how the business drivers Velocity Single-node Agility RDBMS volume, velocity, variability, and agility apply pressure to the single CPU system, resulting in the cracks. Volume and velocity refer to the ability to handle large datasets that arrive quickly. Variability refers to how diverse data types Variability don’t fit into structured tables, and agility refers to how quickly an organization responds to business change.
1.2.1 Volume Without a doubt, the key factor pushing organizations to look at alternatives to their current RDBMSs is a need to query big data using clusters of commodity processors. Until around 2005, performance concerns were resolved by purchasing faster proces- sors. In time, the ability to increase processing speed was no longer an option. As chip density increased, heat could no longer dissipate fast enough without chip overheat- ing. This phenomenon, known as the power wall, forced systems designers to shift their focus from increasing speed on a single chip to using more processors working together. The need to scale out (also known as horizontal scaling), rather than scale up (faster processors), moved organizations from serial to parallel processing where data problems are split into separate paths and sent to separate processors to divide and conquer the work.
1.2.2 Velocity Though big data problems are a consideration for many organizations moving away from RDBMSs, the ability of a single processor system to rapidly read and write data is also key. Many single-processor RDBMSs are unable to keep up with the demands of real-time inserts and online queries to the database made by public-facing websites. RDBMSs frequently index many columns of every new row, a process which decreases system performance. When single-processor RDBMSs are used as a back end to a web store front, the random bursts in web traffic slow down response for everyone, and tun- ing these systems can be costly when both high read and write throughput is desired.
1.2.3 Variability Companies that want to capture and report on exception data struggle when attempt- ing to use rigid database schema structures imposed by RDBMSs. For example, if a business unit wants to capture a few custom fields for a particular customer, all cus- tomer rows within the database need to store this information even though it doesn’t apply. Adding new columns to an RDBMS requires the system be shut down and ALTER TABLE commands to be run. When a database is large, this process can impact system availability, costing time and money.
www.it-ebooks.info 8 CHAPTER 1 NoSQL: It’s about making intelligent choices
1.2.4 Agility The most complex part of building applications using RDBMSs is the process of putting data into and getting data out of the database. If your data has nested and repeated subgroups of data structures, you need to include an object-relational mapping layer. The responsibility of this layer is to generate the correct combination of INSERT, UPDATE, DELETE, and SELECT SQL statements to move object data to and from the RDBMS persistence layer. This process isn’t simple and is associated with the largest bar- rier to rapid change when developing new or modifying existing applications. Generally, object-relational mapping requires experienced software developers who are familiar with object-relational frameworks such as Java Hibernate (or NHiber- nate for .Net systems). Even with experienced staff, small change requests can cause slowdowns in development and testing schedules. You can see how velocity, volume, variability, and agility are the high-level drivers most frequently associated with the NoSQL movement. Now that you’re familiar with these drivers, you can look at your organization to see how NoSQL solutions might impact these drivers in a positive way to help your business meet the changing demands of today’s competitive marketplace. 1.3 NoSQL case studies Our economy is changing. Companies that want to remain competitive need to find new ways to attract and retain their customers. To do this, the technology and people who create it must support these efforts quickly and in a cost-effective way. New thoughts about how to implement solutions are moving away from traditional meth- ods toward processes, procedures, and technologies that at times seem bleeding-edge. The following case studies demonstrate how business problems have successfully been solved faster, cheaper, and more effectively by thinking outside the box. Table 1.2 summarizes five case studies where NoSQL solutions were used to solve particular busi- ness problems. It presents the problems, the business drivers, and the ultimate findings. As you view subsequent sections, you’ll begin to see a common theme emerge: some business problems require new thinking and technology to provide the best solution.
Table 1.2 The key case studies associated with the NoSQL movement—the name of the case study/ standard, the business drivers, and the results (findings) of the selected solutions
Case study/standard Driver Finding
LiveJournal’s Memcache Need to increase performance By using hashing and caching, data in of database queries. RAM can be shared. This cuts down the number of read requests sent to the database, increasing performance.
Google’s MapReduce Need to index billions of web By using parallel processing, indexing pages for search using low-cost billions of web pages can be done hardware. quickly with a large number of commod- ity processors.
www.it-ebooks.info NoSQL case studies 9
Table 1.2 The key case studies associated with the NoSQL movement—the name of the case study/ standard, the business drivers, and the results (findings) of the selected solutions (continued)
Case study/standard Driver Finding
Google’s Bigtable Need to flexibly store tabular By using a sparse matrix approach, data in a distributed system. users can think of all data as being stored in a single table with billions of rows and millions of columns without the need for up-front data modeling.
Amazon’s Dynamo Need to accept a web order 24 A key-value store with a simple interface hours a day, 7 days a week. can be replicated even when there are large volumes of data to be processed.
MarkLogic Need to query large collections By distributing queries to commodity of XML documents stored on servers that contain indexes of XML doc- commodity hardware using stan- uments, each server can be responsible dard query languages. for processing data in its own local disk and returning the results to a query server.
1.3.1 Case study: LiveJournal’s Memcache Engineers working on the blogging system LiveJournal started to look at how their sys- tems were using their most precious resource: the RAM in each web server. Live- Journal had a problem. Their website was so popular that the number of visitors using the site continued to increase on a daily basis. The only way they could keep up with demand was to continue to add more web servers, each with its own separate RAM. To improve performance, the LiveJournal engineers found ways to keep the results of the most frequently used database queries in RAM, avoiding the expensive cost of rerunning the same SQL queries on their database. But each web server had its own copy of the query in RAM; there was no way for any web server to know that the server next to it in the rack already had a copy of the query sitting in RAM. So the engineers at LiveJournal created a simple way to create a distinct “signa- ture” of every SQL query. This signature or hash was a short string that represented a SQL SELECT statement. By sending a small message between web servers, any web server could ask the other servers if they had a copy of the SQL result already exe- cuted. If one did, it would return the results of the query and avoid an expensive round trip to the already overwhelmed SQL database. They called their new system Memcache because it managed RAM memory cache. Many other software engineers had come across this problem in the past. The con- cept of large pools of shared-memory servers wasn’t new. What was different this time was that the engineers for LiveJournal went one step further. They not only made this system work (and work well), they shared their software using an open source license, and they also standardized the communications protocol between the web front ends (called the memcached protocol). Now anyone who wanted to keep their database from getting overwhelmed with repetitive queries could use their front end tools.
www.it-ebooks.info 10 CHAPTER 1 NoSQL: It’s about making intelligent choices
1.3.2 Case study: Google’s MapReduce—use commodity hardware to create search indexes One of the most influential case studies in the NoSQL movement is the Google MapReduce system. In this paper, Google shared their process for transforming large volumes of web data content into search indexes using low-cost commodity CPUs. Though sharing of this information was significant, the concepts of map and reduce weren’t new. Map and reduce functions are simply names for two stages of a data transformation, as described in figure 1.2. The initial stages of the transformation are called the map operation. They’re responsible for data extraction, transformation, and filtering of data. The results of the map operation are then sent to a second layer: the reduce function. The reduce function is where the results are sorted, combined, and summarized to produce the final result. The core concepts behind the map and reduce functions are based on solid com- puter science work that dates back to the 1950s when programmers at MIT imple- mented these functions in the influential LISP system. LISP was different than other programming languages because it emphasized functions that transformed isolated lists of data. This focus is now the basis for many modern functional programming languages that have desirable properties on distributed systems. Google extended the map and reduce functions to reliably execute on billions of web pages on hundreds or thousands of low-cost commodity CPUs. Google made map and reduce work reliably on large volumes of data and did it at a low cost. It was Google’s use of MapReduce that encouraged others to take another look at the power of functional programming and the ability of functional programming systems to scale over thousands of low-cost CPUs. Software packages such as Hadoop have closely modeled these functions.
The map layer extracts the data from the input and transforms the results into The shuffle/sort layer key-value pairs. The key-value pairs are returns the key-value pairs then sent to the shuffle/sort layer. sorted by the keys.
The reduce layer collects the sorted results and performs counts and totals before it returns Map the final results.
Map Input Shuffle Final Reduce data sort result Map
Map
Figure 1.2 The map and reduce functions are ways of partitioning large datasets into smaller chunks that can be transformed on isolated and independent transformation systems. The key is isolating each function so that it can be scaled onto many servers.
www.it-ebooks.info NoSQL case studies 11
The use of MapReduce inspired engineers from Yahoo! and other organizations to create open source versions of Google’s MapReduce. It fostered a growing awareness of the limitations of traditional procedural programming and encouraged others to use functional programming systems.
1.3.3 Case study: Google’s Bigtable—a table with a billion rows and a million columns Google also influenced many software developers when they announced their Big- table system white paper titled A Distributed Storage System for Structured Data. The moti- vation behind Bigtable was the need to store results from the web crawlers that extract HTML pages, images, sounds, videos, and other media from the internet. The result- ing dataset was so large that it couldn’t fit into a single relational database, so Google built their own storage system. Their fundamental goal was to build a system that would easily scale as their data increased without forcing them to purchase expensive hardware. The solution was neither a full relational database nor a filesystem, but what they called a “distributed storage system” that worked with structured data. By all accounts, the Bigtable project was extremely successful. It gave Google developers a single tabular view of the data by creating one large table that stored all the data they needed. In addition, they created a system that allowed the hardware to be located in any data center, anywhere in the world, and created an environment where developers didn’t need to worry about the physical location of the data they manipulated.
1.3.4 Case study: Amazon’s Dynamo—accept an order 24 hours a day, 7 days a week Google’s work focused on ways to make distributed batch processing and reporting easier, but wasn’t intended to support the need for highly scalable web storefronts that ran 24/7. This development came from Amazon. Amazon published another signifi- cant NoSQL paper: Amazon’s 2007 Dynamo: A Highly Available Key-Value Store. The busi- ness motivation behind Dynamo was Amazon’s need to create a highly reliable web storefront that supported transactions from around the world 24 hours a day, 7 days a week, without interruption. Traditional brick-and-mortar retailers that operate in a few locations have the lux- ury of having their cash registers and point-of-sale equipment operating only during business hours. When not open for business, they run daily reports, and perform back- ups and software upgrades. The Amazon model is different. Not only are their custom- ers from all corners of the world, but they shop at all hours of the day, every day. Any downtime in the purchasing cycle could result in the loss of millions of dollars. Ama- zon’s systems need to be iron-clad reliable and scalable without a loss in service. In its initial offerings, Amazon used a relational database to support its shopping cart and checkout system. They had unlimited licenses for RDBMS software and a consulting budget that allowed them to attract the best and brightest consultants for
www.it-ebooks.info 12 CHAPTER 1 NoSQL: It’s about making intelligent choices
their projects. In spite of all that power and money, they eventually realized that a rela- tional model wouldn’t meet their future business needs. Many in the NoSQL community cite Amazon’s Dynamo paper as a significant turn- ing point in the movement. At a time when relational models were still used, it chal- lenged the status quo and current best practices. Amazon found that because key- value stores had a simple interface, it was easier to replicate the data and more reli- able. In the end, Amazon used a key-value store to build a turnkey system that was reli- able, extensible, and able to support their 24/7 business model, making them one of the most successful online retailers in the world.
1.3.5 Case study: MarkLogic In 2001 a group of engineers in the San Francisco Bay Area with experience in docu- ment search formed a company that focused on managing large collections of XML documents. Because XML documents contained markup, they named the company MarkLogic. MarkLogic defined two types of nodes in a cluster: query and document nodes. Query nodes receive query requests and coordinate all activities associated with execut- ing a query. Document nodes contain XML documents and are responsible for executing queries on the documents in the local filesystem. Query requests are sent to a query node, which distributes queries to each remote server that contains indexed XML documents. All document matches are returned to the query node. When all document nodes have responded, the query result is then returned. The MarkLogic architecture, moving queries to documents rather than moving documents to the query server, allowed them to achieve linear scalability with peta- bytes of documents. MarkLogic found a demand for their products in US federal government systems that stored terabytes of intelligence information and large publishing entities that wanted to store and search their XML documents. Since 2001, MarkLogic has matured into a general-purpose highly scalable document store with support for ACID transac- tions and fine-grained, role-based access control. Initially, the primary language of MarkLogic developers was XQuery paired with REST; newer versions support Java as well as other language interfaces. MarkLogic is a commercial product that requires a software license for any data- sets over 40 GB. NoSQL is associated with commercial as well as open source products that provide innovative solutions to business problems.
1.3.6 Applying your knowledge To demonstrate how the concepts in this book can be applied, we introduce you to Sally Solutions. Sally is a solution architect at a large organization that has many busi- ness units. Business units that have information management issues are assigned a solution architect to help them select the best solution to their information challenge.
www.it-ebooks.info Summary 13
Sally works on projects that need custom applications developed and she’s knowledge- able about SQL and NoSQL technologies. Her job is to find the best fit for the busi- ness problem. Now let’s see how Sally applies her knowledge in two examples. In the first exam- ple, a group that needed to track equipment warranties of hardware purchases came to Sally for advice. Since the hardware information was already in an RDBMS and the team had experience with SQL, Sally recommended they extend the RDBMS to include warranty information and create reports using joins. In this case, it was clear that SQL was appropriate. In the second example, a group that was in charge of storing digital image infor- mation within a relational database approached Sally because the performance of the database was negatively impacting their web application’s page rendering. In this case, Sally recommended moving all images to a key-value store, which referenced each image with a URL. A key-value store is optimized for read-intensive applications and works with content distribution networks. After removing the image management load from the RDBMS, the web application as well as other applications saw an improvement in performance. Note that Sally doesn’t see her job as a black-and-white, RDBMS versus NoSQL selection process. Sometimes the best solution involves using hybrid approaches. 1.4 Summary This chapter began with an introduction to the concept of NoSQL and reviewed the core business drivers behind the NoSQL movement. We then showed how the power wall forced systems designers to use highly parallel processing designs and required a new type of thinking for managing data. You also saw that traditional systems that use object-middle tiers and RDBMS databases require the use of complex object-relational mapping systems to manipulate the data. These layers often get in the way of an orga- nization’s ability to react quickly to changes (agility). When we venture into any new technology, it’s critical to understand that each area has its own patterns of problem solving. These patterns vary dramatically from technology to technology. Making the transition from SQL to NoSQL is no different. NoSQL is a new paradigm and requires a new set of pattern recognition skills, new ways of thinking, and new ways of solving problems. It requires a new cognitive style. Opting to use NoSQL technologies can help organizations gain a competitive edge in their market, making them more agile and better equipped to adapt to changing business conditions. NoSQL approaches that leverage large numbers of commodity processors save companies time and money and increase service reliability. As you’ve seen in the case studies, these changes impacted more than early tech- nology adopters: engineers around the world realize there are alternatives to the RDBMS-as-our-only-option mantra. New companies focused on new thinking, technol- ogies, and architectures have emerged not as a lark, but as a necessity to solving real
www.it-ebooks.info 14 CHAPTER 1 NoSQL: It’s about making intelligent choices
business problems that don’t fit into a relational mold. As organizations continue to change and move into global economies, this trend will continue to expand. As we move into our next chapter, we’ll begin looking at the core concepts and technologies associated with NoSQL. We’ll talk about simplicity of design and see how it’s fundamental to creating NoSQL systems that are modular, scalable, and ultimately lower-cost to you and your organization.
www.it-ebooks.info NoSQL concepts
This chapter covers
NoSQL concepts ACID and BASE for reliable database transactions How to minimize downtime with database sharding Brewer’s CAP theorem
Less is more. —Ludwig Mies van der Rohe
In this chapter, we’ll cover the core concepts associated with NoSQL systems. After reading this chapter, you’ll be able to recognize and define NoSQL concepts and terms, you’ll understand NoSQL vendor products and features, and you’ll be able to decide if these features are appropriate for your NoSQL system. We’ll start with a discussion about how using simple components in the application development process removes complexity and promotes reuse, saving you time and money in your system’s design and maintenance. 2.1 Keeping components simple to promote reuse If you’ve worked with relational databases, you know how complex they can be. Generally, they begin as simple systems that, when requested, return a selected row
15
www.it-ebooks.info 16 CHAPTER 2 NoSQL concepts
from a single flat file. Over time, they need to do more and evolve into systems that manage multiple tables, perform join operations, do query optimization, replicate transactions, run stored procedures, set triggers, enforce security, and perform index- ing. NoSQL systems use a different approach to solving these complex problems by creating simple applications that distribute the required features across the network. Keeping your architectural components simple allows you to reuse them between applications, aids developers in understanding and testing, and makes it easier to port your application to new architectures. From the NoSQL perspective, simple is good. When you create an application, it’s not necessary to include all functions in a single software application. Application functions can be distributed to many NoSQL (and SQL) databases that consist of sim- ple tools that have simple interfaces and well-defined roles. NoSQL products that fol- low this rule do a few things and they do them well. To illustrate, we’ll look at how systems can be built using well-defined functions and focus on how easy it is to build these new functions. If you’re familiar with UNIX operating systems, you might be familiar with the con- cept of UNIX pipes. UNIX pipes are a set of processes that are chained together so that the output of one process becomes the input to the next process. Like UNIX pipes, NoSQL systems are often created by integrating a large number of modular functions that work together. An example of creating small functions with UNIX pipes to count the total number of figures in a book is illustrated in figure 2.1. What’s striking about this example is that by typing about 40 characters you’ve cre- ated a useful function. This task would be much more difficult to do on systems that don’t support UNIX-style functions. In reality, only a query on a native XML database might be shorter than this command, but it wouldn’t also be general-purpose. Many NoSQL systems are created using a similar philosophy of modular compo- nents that can work together. Instead of having a single large database layer, they often
From the input extract only Concatenate all files that the lines that have ‘
$ cat ch*.xml | grep '
www.it-ebooks.info Case study: searching technical documentation 167
DocBook is frequently customized for different types of publishing. Each organization that’s publishing a document will select a subset of DocBook elements and then add their own elements to meet their specific application. For example, a math textbook might include XML markup for equations (MathML), a chemistry textbook might include markup for chemical symbols (ChemML), and an economics textbook might add charts in XML format. These new XML vocabularies can be placed in different namespaces added to DocBook XML without disrupting the publishing processes.
7.8.2 Retaining document structure in a NoSQL document store There are several ways to perform search on large collections of DocBook files. The most straightforward is to strip out all the markup information and send each docu- ment to Apache Lucene to create a reverse index. Each word would then be associ- ated with a single document ID. The problem with this approach is that all the information about the word location within the document is lost. If a word occurs in a book or chapter title, it can’t be ranked higher than if the word occurs in a biblio- graphic note. Ideally, you want to retain the entire document structure and store the XML file in a native XML database. Then any match within a title can have a higher rank than if the match occurs within the body of the text. The first step in creating a search function is to load all the XML documents into a collection structure. This structure logically groups similar documents and makes it easy to navigate the documents, similar to a file browser. After the documents have been loaded, you can run a script to find all unique elements in the document collec- tion. This is known as an element inventory. The element inventory is then used as a basis for deciding what elements might contain information that you want to index for quick searches, and what index types you’ll use. Elements that contain dates might use a range index and elements such as
We should note that the boost values are Paragraph text 1.0 also stored with the search result indexes so that they can be used to create precise Bibliographic reference 0.5
www.it-ebooks.info 168 CHAPTER 7 Finding information with NoSQL search
search rankings. This means that if you change the boost values, the documents must be re-indexed. Although this example is somewhat simplified, it shows that accurate markup of book elements is critical to the search ranking process. Once you’ve determined the elements and boost values, you’ll create a configura- tion file that identifies the fields you’re interested in indexing. From there you can run a process that takes each document and creates a reverse full-text index using the element and boost values from your configuration file. Apache Lucene is an example of a framework that creates and maintains these type of indexes. All the keywords found in that element can then be associated with that element using a node identi- fier for that element. By storing the element node as well as the document, you know exactly in what element of the document the keyword was found. After indexing, you’re now ready to create search functions that can work with both range and full-text indexes. The most common way to integrate text searches is by using an XQuery full-text library that returns the ranked results of a keyword query. The query is similar to a WHERE clause in SQL, but it also returns a score used to order all search results. Your XQuery can return any type of node within DocBook, such as a book, article, chapter, section, figure, or bibliographic entry. The final step is to return a fragment of HTML for each hit in the search. At the top of the page, you’ll see the hits with the highest score. Most search tools return a block of text that shows the keyword highlighted within the text. This is known as a key-word-in-context (KWIC) function. 7.9 Case study: searching domain-specific languages— findability and reuse Although we frequently think of search quality as a characteristic associated with a large number of text documents, there are also benefits to finding items such as soft- ware subroutines or specific types of programs created with domain-specific languages (DSLs). This case study shows how a search tool saved an organization time and money by allowing employees to find and reuse financial chart objects. A large financial institution had thousands of charts used to create graphical finan- cial dashboards. Most charts were generated by an XML specification file that described the features of each chart such as the chart type (line chart, bar chart, scatter-plot), title, axis, scaling, and labels. One of the challenges that the dashboard authors faced was how to lower the cost of creating a new chart by using an existing chart as a starting template. All charts were stored on a standard filesystem. Each organization that requested charts had a folder that contained their charts. Because of the structure, there was no way to find charts sorted by their characteristics. Experienced chart authors knew where to look in the filesystem for an example of a template, but new chart authors often spent hours digging through old charts to find an old template that matched up with the new requirement.
www.it-ebooks.info Case study: searching domain-specific languages—findability and reuse 169
One day a new staff member spent most of his day re-creating a chart when a simi- lar chart already existed, but couldn’t be found. In a staff meeting a manager asked if there was some way that the charts could be loaded into a database and searched. Storing charts in a relational database would’ve been a multimonth-long task. There were hundreds of chart properties and multiple chart variations. Even the pro- cess of adding keywords to each chart and placing them in a word document would’ve been time consuming. This is an excellent example showing that high-variability data is best stored in a NoSQL system. Instead of loading the charts into an RDBMS, the charts were loaded into an open source native XML document store (eXist-db) and a series of path expressions were created to search for various chart types. For example, all charts that had time across the horizontal x-axis could be found using an XPath expression on the x-axis descrip- tor. After finding specific charts with queries, chart keywords could be added to the charts using XQuery update statements. You might find it ironic that the XML-based charting system was the preferred solu- tion of an organization that had hundreds of person-years experience with RDBMSs in the department. But the cost estimates to develop a full RDBMS seriously outweighed the benefits. Since the data was in XML format, there was no need for data modeling; they simply loaded and queried the information. A search form was then added to find all charts with specific properties. The chart titles, descriptions, and developer note elements were indexed using the Apache Lucene full-text indexing tools. The search form allowed users to restrict searches by various chart properties, organization, and dates. After entering search criteria, the user performed a search, and preview icons of the charts were returned directly in the search results page. As a result of creating the chart search service, the time for finding a chart in the chart library dropped from hours to a matter of seconds. A close match to the new tar- get chart was usually returned within the first 10 results in the search screen. The company achieved additional benefits from being able to perform queries over all the prior charts. Quality and consistency reports were created to show which charts were consistent with the bank’s approved style guide. New charts could also be vali- dated for quality and consistency guidelines before they were used by a business unit. An unexpected result of the new system was other groups within the organization began to use the financial dashboard system. Instead of building custom charts with low-level C programs, statistical programs, or Microsoft Excel, there was increased use of the XML chart standard, because non-experts could quickly find a chart that was sim- ilar to their needs. Users also knew that if they created a high-quality chart and added it to the database, there was a greater chance that others could reuse their work. This case study shows that as software systems increase in complexity, finding the right chunk of code becomes increasingly important. Software reuse starts with find- ability. The phrase “you can’t reuse what you can’t find” is a good summary of this approach.
www.it-ebooks.info 170 CHAPTER 7 Finding information with NoSQL search
7.10 Apply your knowledge Sally works in an information technology department with many people involved in the software development lifecycle (SDLC). SDLC documents include requirements, use cases, test plans, business rules, business terminology, report definitions, bug reports, and documentation, as well as the actual source code being developed. Although Sally’s department built high-quality search solutions for their business units, the proverb “The shoemaker’s children go barefoot” seemed to apply to their group. SDLC documents were stored in many formats such as MS Word, wikis, spread- sheets, and source code repositories, in many locations. There were always multiple versions and it wasn’t clear which versions were approved by a business unit or who should approve documents. The source code repositories the department used had strong keyword search, yet there was no way users could perform faceted queries such as “show all new features in the 3.0 release of an internal product approved by Sue Johnson after June 1.” Sally realized that putting SDLC documents in a single NoSQL database that had integrated search features could help alleviate these problems. All SDLC documents from requirements, source code, and bugs could be treated as documents and searched with the tools provided by the NoSQL database vendor. Documents that had structure could also be queried using faceted search interfaces. Since almost all docu- ments had timestamps, the database could create timeline views that allowed users to see when code was checked in and by what developers and relate these to bugs and problem reports. The department also started to add more metadata into the searchable database. This included information about database elements and their definitions, list of tables, columns, business rules, and process flows. This became a flexible metadata registry for an official reviewed and approved “single version of the truth.” Using a NoSQL database as a integrated document store and metadata registry allowed the team to quickly increase the productivity of the department. In time, new web forms and easy-to-modify wiki-like structures were created to make it easier for developers to add and update SDLC data. 7.11 Summary In the chapter on big data, you saw that the amount of available data generated by the web and internal systems continues to grow exponentially. As organizations continue to put this information to use, the ability to locate the right information at the right time is of growing concern. In this chapter, we’ve focused attention on showing you how to find the right item in your big data collection. We’ve talked about the types of searches that can be done by your NoSQL database and the ways in which NoSQL sys- tems make searching fast. You’ve seen how retaining a document’s structure in a document store can increase the quality of search results. This process is enabled by associating a keyword, not with a document, but with the element that contains the keyword within a document.
www.it-ebooks.info Further reading 171
Although we focused on topics you’ll need to fairly evaluate search components of a NoSQL system, we also demonstrated that highly scalable processes such as Map- Reduce can be used to create reverse indexes that enable fast search. Finally, our case studies showed how search solutions can be created using open source native XML databases and Apache Lucene frameworks. Both the previous chapter on big data and this chapter on search emphasize the need for multiple processors working together to solve problems. Most NoSQL sys- tems are a great fit for these tasks. NoSQL databases integrate the complex concepts of information retrieval to increase the findability of items in your database. In our next chapter, we’ll focus on high availability: how to keep all these systems running reliably. 7.12 Further reading AnyChart. “Building a Large Chart Ecosystem with AnyChart and Native XML Data- bases.” http://mng.bz/Pknr. DocBook. http://docbook.org/. “Faceted search.” Wikipedia. http://mng.bz/YgQq. Feldman, Susan, and Chris Sherman. “The High Cost of Not Finding Information.” 2001. http://mng.bz/IX01. Manning, Christopher, et al. Introduction to Information Retrieval. 2008, Cambridge University Press. McCreary, Dan. “Entity Extraction and the Semantic Web.” http://mng.bz/20A7. Morville, Peter. Ambient Findability. October 2005, O’Reilly Media. NLP. “XML retrieval.” http://mng.bz/1Q9i.
www.it-ebooks.info Building high-availability solutions with NoSQL
This chapter covers
What is high availability? Measuring availability NoSQL strategies for high availability
Anything that can go wrong will go wrong. —Murphy’s law
Have you ever been using a computer application when it suddenly stops respond- ing? Intermittent database failures can be merely an annoyance in some situations, but high database availability can also mean the success or failure of a business. NoSQL systems have a reputation for being able to scale out and handle big data problems. These same features can also be used to increase the availability of data- base servers. There are several reasons databases fail: human error, network failure, hard- ware failure, and unanticipated load, to name a few. In this chapter, we won’t dwell on human error or network failure. We’ll focus on how NoSQL architectures use parallelism and replication to handle hardware failure and scaling issues.
172
www.it-ebooks.info What is a high-availability NoSQL database? 173
You’ll see how NoSQL databases can be configured to handle lots of data and keep data services running without downtime. We’ll begin by defining high-availability data- base systems and then look at ways to measure and predict system availability. We’ll review techniques that NoSQL systems use to create high-availability systems even when subcomponents fail. Finally, we’ll look at three real-world NoSQL products that are associated with high-availability service. 8.1 What is a high-availability NoSQL database? High-availability NoSQL databases are systems designed to run without interruption of service. Many web-based businesses require data services that are available without interruption. For example, databases that support online purchasing need to be avail- able 24 hours a day, 7 days a week, 365 days a year. Some requirements take this a step further, specifying that the database service must be “always on.” This means you can’t take the database down for scheduled maintenance or to perform software upgrades. Why must they be always on? Companies demanding an always-on environment can document a measurable loss in income for every minute their service isn’t avail- able. Let’s say your database supports a global e-commerce site; being down for even a few minutes could wipe out a customer’s shopping cart. Or what if your system stops responding during prime-time shopping hours in Germany? Interruptions like these can drive shoppers to your competitor’s site and lower customer confidence. From a software development perspective, always-on databases are a new require- ment. Before the web, databases were designed to support “bankers’ hours” such as 9 a.m. to 5 p.m., Monday through Friday in a single time zone. During off hours, these systems might be scheduled for downtime to perform backups, run software updates, run reports, or export daily transactions to data warehouse systems. But bankers’ hours are no longer appropriate for web-based businesses with customers around the world. A web storefront is a good example of a situation that needs a high-availability database that supports both reads and writes. Read-mostly systems optimized for big data analysis are relatively simple to configure for high availability using data replica- tion. Our focus here is on high availability for large-volume read/write applications that run on distributed systems. Always-on database systems aren’t new. Companies like Tandem Computers’ Non- Stop system have provided commercial high-availability database systems for ATM net- works, telephone switches, and stock exchanges since the 1970s. These systems use symmetrical, peer-to-peer, shared-nothing processors that send messages between pro- cessors about overall system health. They use redundant storage and high-speed fail- over software to provide continuous database services. The biggest drawbacks to these systems are that they’re proprietary, difficult to set up and configure, and expensive on a cost-per-transaction basis. Distributed NoSQL systems can lower the per-transaction cost of systems that need to be both scalable and always-on. Although most NoSQL systems use nonstandard
www.it-ebooks.info 174 CHAPTER 8 Building high-availability solutions with NoSQL
query languages, their design and the ability to be deployed on low-cost cloud com- puting platforms make it possible for startups, with minimal cash, to provide always-on databases for their worldwide customers. Understanding the concept of system availability is critical when you’re gathering high-level system requirements. Since NoSQL systems use distributed computing, they can be configured to achieve high availability of a specific service at a minimum cost. To understand the concepts in this chapter, we’ll draw on the CAP theorem from chapter 2. We said that when communication channels between partitions are broken, system designers need to choose the level of availability they’ll provide to their cus- tomers. Organizations often place a higher priority on availability than on consistency. The phase “A trumps C” implies that keeping orders flowing through a system is more important than consistent reporting during a temporary network failure to a replica server. Recall that these decisions are only relevant when there are network failures. During normal operations, the CAP theorem doesn’t come into play. Now that you have a good understanding of what high-availability NoSQL systems are and why they’re good choices, let’s find out how to measure availability. 8.2 Measuring availability of NoSQL databases System availability can be measured in different ways and with different levels of preci- sion. If you’re writing availability requirements or comparing the SLAs of multiple sys- tems, you may need to be specific about these measurements. We’ll start with some broad measures of overall system availability and then dig deeper into more subtle measurements of system availability. The most common notation for describing overall system availability is to state availability in terms of “nines,” which is a count of how many times the number 9 appears in the designed availability. So three nines means that a system is predicted to be up 99.9% of the time, and five nines means the system should be up 99.999% of the time. Table 8.1 shows some sample calculations of downtime per year based on typical availability targets. Stating your uptime requirements isn’t an exact science. Some businesses can asso- ciate a revenue-lost-per-minute to total data service unavailability. There are also gray areas where a system is so slow that a few customers abandon their shopping carts and
Table 8.1 Sample availability targets and annual downtime
Availability % Annual downtime
99% (two nines) 3.65 days
99.9% (three nines) 8.76 hours
99.99% (four nines) 52.56 minutes
99.999% (five nines) 5.26 minutes
www.it-ebooks.info Measuring availability of NoSQL databases 175 move on to another site. There are other factors, not as easily measured, which can lead to losses as well, such as poor reputation and lack of confidence. Measuring overall system availability is more than generating a single number. To fairly evaluate NoSQL systems, you need an understanding of the subtleties of avail- ability measurements. If a business unit indicates they can’t afford to be down more than eight hours in a calendar year, then you want to build an infrastructure that would provide three-nines availability. Most land-line telephone switches are designed for five-nines availability, or no more than five minutes of downtime per year. Today five nines is considered the gold standard for data services, with few situations warranting greater availability. Although the use of counted nines is a common way to express system availability, it’s usually not detailed enough to understand business impact. An outage for 30 sec- onds may seem to users like a slow day on the web. Some systems may show partial out- age but have other functions that can step in to take their place, making the system only appear to work slowly. The end result is that no simple metric can be used to measure overall system availability. In practice, most systems look at the percentage of service requests that go beyond a specific threshold. As a result, the term service level agreement or SLA is used to describe the detailed availability targets for any data service. An SLA is a written agreement between a ser- vice provider and a customer who uses the service. The SLA doesn’t concern itself with how the data service is provided. It defines what services will be provided and the avail- ability and response time goals for the service. Some items to consider when creating an SLA are
What are the general service availability goals of the service in terms of percent- age uptime over a one-year period? What are the typical average response times for the service under normal oper- ations? Typically these are specified in milliseconds between service request and response. What is the peak volume of requests that the service is designed to handle? This is typically specified in requests per second. Are there any cyclical variations in request volumes? For example, do you expect to see peaks at specific times of day, days of the week or month, or times of the year like holiday shopping or sporting events? How will the system be monitored and service availability reported? What is the shape of the service-call response distribution curve? Keeping track of the average response time may not be useful. Organizations focus on the slowest 5% of service calls. What procedures should be followed during a service interruption? NoSQL system configuration may be dependent on some of the exceptions to the general rules. The key focus should not be a single availability metric.
www.it-ebooks.info 176 CHAPTER 8 Building high-availability solutions with NoSQL
8.2.1 Case study: the Amazon’s S3 SLA Now let’s look at Amazon’s SLA for their S3 key-value store service. Amazon’s S3 is known as the most reliable cloud-based, key-value store service available. S3 consis- tently performs well, even when the number of reads or writes on a bucket spikes. The system is rumored to be the largest, containing more than 1 trillion stored objects as of the summer of 2012. That’s about 150 objects for every person on the planet. Amazon discusses several availability numbers on their website:
Annual durability design—This is the designed probability that a single key-value item will be lost over a one-year period. Amazon claims their design durability is 99.999999999%, or 11 nines. This number is based on the probability that your data object, which is typically stored on three hard drives, has all three drives fail before the data can be backed up. This means that if you store 10,000 items each year in S3 and continue to do so for 10 million years, there’s about a 50% probability you’ll lose one file. Not something that you should lose much sleep over. Note that a design is different from a service guarantee. Annual availability design—This is a worst-case measure of how much time, over a one-year period, you’ll be unable to write new data or read your data back. Amazon claims a worst-case availability of 99.99%, or four-nines availability for S3. In other words, in the worse case, Amazon thinks your key-value data store may not work for about 53 minutes per year. In reality, most users get much bet- ter results. Monthly SLA commitment—In the S3 SLA, Amazon will give you a 10% service credit if your system is not up 99.9% of the time in any given month. If your data is unavailable for 1% of the time in a month, you’ll get a 25% service credit. In practice, we haven’t heard of any Amazon customer getting SLA credits. It’s also useful to read the wording of the Amazon SLA carefully. For example, it defines an error rate as the number of S3 requests that return an internal status error code. There’s nothing in the SLA about slow response times. In practice, most users will get S3 availability that far exceeds the minimum num- bers in the SLA. One independent testing service found essentially 100% availability for S3, even under high loads over extended periods of time.
8.2.2 Predicting system availability If you’re building a NoSQL database, you need to be able predict how reliable your database will be. You need tools to analyze the response times of database services. Availability prediction methods calculate the overall availability of a system by look- ing at the predicted availability of each of the dependent (single-point-of-failure) sub- components. If each subsystem is expressed as a simple availability prediction such as 99.9, then multiplying each number together will give you an overall availability pre- diction. For example, if you have three single points of failure—99.9% for network,
www.it-ebooks.info Measuring availability of NoSQL databases 177
99% for master node, and 99.9% for power—then the total system availability is the product of these three numbers: 98.8% (99.9 x 99 x 99.9). If there are single points of failure such as a master or name node, then NoSQL systems have the ability to gracefully switch over to use a backup node without a major service interruption. If a system can quickly recover from a failing component, it’s said to have a property of automatic failover. Automatic failover is the general property of any service to detect a failure and switch to a redundant component. Failback is the process of restoring a system component to its normal operation. Generally, this pro- cess requires some data synchronization. If your systems are configured with a single failover, you must use the probability that the failover process doesn’t work in combi- nation with the odds that the failover system fails before failback. There are other metrics you can use besides the failure metric. If your system has client request timeout of 30 seconds, you’ll want to measure the total percentage of client requests that fail. In such a case, a better metric might be a factor called client yield, which is the probability of any request returning within a specified time interval. Other metrics, such as a harvest metric, apply when you want to include partial API results. Some services, such as federated search engines, may also return partial results. For example, if you search 10 separate remote systems and one of the sites is down for your call window of 30 seconds, you’d have a 90% harvest for that specific call. Harvest is the data available divided by the total data sources. Finding the best NoSQL service for your application may require comparing the architecture of two different systems. The actual architecture may be hidden from you behind a web service interface. In these cases, it might make the most sense to set up a small pilot project to test the services under a simulated load. When you set up a pilot project that includes stress testing, a key measurement will be a frequency distribution chart of read and write response times. These distribu- tions can give you hints about whether a database service will scale. A key point of this analysis is that instead of focusing on average or mean response times, you should look at how long the slowest 5% of your services take to return. In general, a service with consistent response times will have higher availability than systems that some- times have a high percentage of slow responses. Let’s take a look at an example of this.
8.2.3 Apply your knowledge Sally is evaluating two NoSQL options for a business unit that’s concerned about web page response times. Web pages are rendered with data from a key-value store. Sally has narrowed down the field to two key-value store options; we’ll call them Service A and Service B. Sally uses JMeter, a popular performance monitoring tool, to create a chart that has read service response distributions, as shown in figure 8.1. When Sally looks at the data, she sees that service A has faster mean response times. But at the 95th percentile level, they’re longer than service B. Service B may have slower average response times, but they’re still within the web page load time
www.it-ebooks.info 178 CHAPTER 8 Building high-availability solutions with NoSQL
Count Service A Service B
95% values
Mean values Web service response time
Figure 8.1 Frequency distribution chart showing mean vs. 95th percentile response times. Notice two web service response distributions for two NoSQL key-value data stores under load. Service A has a faster mean value response time but much longer response times at the 95th percentile. Service B has longer mean value response times but shorter 95% value responses.
goals. After discussing the results with the business unit, the team selects service B, since they feel it’ll have more consistent response time under load. Now that we’ve looked at how you predict and measure system availability, let's take a look at strategies that NoSQL clusters use to increase system availability. 8.3 NoSQL strategies for high availability In this section, we’ll review several strategies NoSQL systems use to create high- availability services backed by clusters of two or more systems. This discussion will include the concepts of load balancing, clusters, replication, load and stress testing, and monitoring. As we look at each of these components, you’ll see how NoSQL systems can be con- figured to provide maximum availability of the data service. One of the first questions you might ask is, “What if the NoSQL database crashes?” To get around this problem, a replica of the database can be created.
8.3.1 Using a load balancer to direct traffic to the least busy node Websites that aim for high availability use a front-end service called a load balancer. A diagram of a load balancer is shown in figure 8.2. In this figure, service requests enter on the left and are sent to a pool of resources called the load balancer pool. The service requests are sent to a master load balancer and then forwarded to one of the application servers. Ideally, each application server has some type of load indication that tells the load balancer how busy it is. The least- busy application server will then receive the request. Application servers are responsi- ble for servicing the request and returning the results. Each application server may request data services from one or many NoSQL servers. The result of these query requests are returned and the service is fulfilled.
www.it-ebooks.info NoSQL strategies for high availability 179
Distribute the request Applications get their to the least-busy data from many Listen to all service node. databases. requests on a port. Application NoSQL Incoming server Primary load balancer database requests
Outgoing Application Failover load balancer NoSQL responses server database
Application Indicates what NoSQL server applications Heartbeat signal database are healthy.
Figure 8.2 A load balancer is ideal when you have a large number of processors that can each fulfill a service request. To gain performance advantages, all service requests arrive at a load balancer service that distributes the request to the least-busy processor. A heartbeat signal from each application server provides a list of which application servers are working. An application server may request data from one or more NoSQL databases.
8.3.2 Using high-availability distributed filesystems with NoSQL databases Most NoSQL systems are designed to work on a high-availability filesystem such as the Hadoop Distributed File System (HDFS). If you’re using a NoSQL system such as Cas- sandra, you’ll see that it has its own HDFS compatible filesystem. Building a NoSQL system around a specific filesystem has advantages and disadvantages. Advantages of using a distributed filesystem with a NoSQL database: Reuse of reliable components—Reusing prebuilt and pretested system components makes sense with respect to time and money. Your NoSQL system doesn’t need to duplicate the functions in a distributed filesystem. Additionally, your organi- zation may already have an infrastructure and trained staff who know how to set up and configure these systems. Customizable per-folder availability—Most distributed filesystems can be config- ured on a folder-by-folder basis for high availability. This gets around using a local filesystem with single points of failure to store input or output datasets. These systems can be configured to store your data in multiple locations; the default is generally three. This means that a client request would only fail if all three systems crashed at the same time. The odds of this occurring are low enough that three are sufficient for most service levels. Rack and site awareness—Distributed filesystem software is designed to factor in how computer clusters are organized in your data center. When you set up your filesystem, you indicate which nodes are placed in which racks with the assump- tion that nodes within a rack have higher bandwidth than nodes in different racks. Racks can also be placed in different data centers, and filesystems can
www.it-ebooks.info 180 CHAPTER 8 Building high-availability solutions with NoSQL
immediately duplicate data on racks in remote data centers if one data center goes down. Disadvantages of using a distributed filesystem with a NoSQL database: Lower portability—Some distributed filesystems work best on UNIX or Linux serv- ers. Porting these filesystems to other operating systems, such as Windows, may not be possible. If you need to run on Windows, you may need to add an addi- tional virtual machine layer, which may impose a performance penalty. Design and setup time—When you’re setting up a well-designed distributed file- system, it may take some time to figure out the right folder structure. All the files within a folder share the same properties, such as replication factor. If you use creation dates as folder names, you may be able to lower replication for files that are over a specific period of time such as two years. Administrative learning curve—Someone on your staff will need to learn how to set up and manage a distributed filesystem. These systems need to be moni- tored and sensitive data must be backed up.
8.3.3 Case study: using HDFS as a high-availability filesystem to store master data In chapter 6 on managing big data, we introduced the Hadoop Distributed File Sys- tem (HDFS). As you recall, HDFS can reliably store large files, typically from a gigabyte to a terabyte. HDFS can also tune replication on a file-by-file basis. By default most files in HDFS have a replication factor of 3, meaning that the data blocks that make up these files are located on three distinct nodes. A simple HDFS shell command can change the replication factor for any HDFS file or folder. There are two reasons you should raise the replication factor of a file in HDFS:
To lower the chance that the data will become unavailable—For example, if you have a data service that depends on this data with a five-nines guarantee, you might want to increase the replication factor from 3 to 4 or 5. To increase the read-access times—If files have a large number of concurrent reads, you can increase the replication factor to allow more nodes to respond to those read requests. The primary reasons you lower replication is if you’re running out of disk space or you no longer require the same service levels required by high replication counts. If you’re concerned about running out of disk space, the replication factor can be low- ered again when the lack of data availability costs are lower and the read requirements aren’t as demanding. It’s not unusual for replication counts to go down as data gets older. For example, data that’s over a year old might only have a replication factor of 2, and data that’s over five years only may have a replication factor of 1. One of the nice features of HDFS is its rack awareness. This is the ability for you to logically group HDFS nodes together in structures that reflect the way processors are stored in physical racks and connected together by an in-rack network. Nodes that are
www.it-ebooks.info NoSQL strategies for high availability 181
Block 1 is stored Block 2 is stored on two servers on on two servers on rack 1 and one 1 2 1 2 rack 2 and one server on rack 2. server on rack 1.
Server 1 Server 3
1 2
Server 2 Server 4
Rack 1 Rack 2
Figure 8.3 HDFS is designed to have rack awareness so that you can instruct data blocks to be spread onto multiple racks, which could be located in multiple data centers. In this example, all blocks are stored on three separate servers (replication factor of 3) and HDFS spreads the blocks over two racks. If either rack becomes unavailable, there will always be a replica of both block 1 and block 2. physically stored in the same rack usually have higher bandwidth connectivity between the nodes and using this network keeps data off of other shared networks. This is shown in figure 8.3. One of the advantages of rack awareness is that you can increase your availability by carefully distributing HDFS blocks on different racks. We’ve also discussed how NoSQL systems move the queries to the data, not the data to the query. Since the data is stored on three nodes, which node should run the query? The answer is usually the least-busy node. How does the query distribution sys- tem know what nodes hold the data? This is where there must be a tight coupling of the filesystem and the query distribution system. The information about what node the data is on must be communicated to the client program. The primary disadvantage of using an external filesystem is that your database may not be as portable on operating systems that don’t support these filesystems. For example, HDFS is usually run on UNIX or Linux operating systems. If you want to deploy HBase—which is designed to run on HDFS—you may have more hoops to jump through to get HDFS to run on a Windows system. Using a virtual machine is one way to do this, but there can be a performance penalty with using virtual machines. We should note that you can get the same replication factor by using an off-the- shelf storage area network (SAN). What you won’t get with this configuration is an easy way to keep the query and the data on the same server. Using a SAN for high-availability databases can work for small datasets, but larger datasets will result in excessive net- work traffic. In the long run, the Hadoop architecture of shared-nothing processors all working on their own copy of a large dataset is the most scalable solution. Setting up a Hadoop cluster can be a great way to make sure your NoSQL database has both high availability and scale-out performance. Early versions of Hadoop (usu- ally referred to as version 1.x) had a single point of failure on the NameNode service. For high availability, Hadoop used a secondary failover node that was automatically
www.it-ebooks.info 182 CHAPTER 8 Building high-availability solutions with NoSQL
replicated with the data and could be swapped in if the master NameNode failed. Since 2010, there have been specialized releases of Hadoop that removed the single point of failure of the Hadoop NameNode. Although the NameNode was a weak link in setting up early Hadoop clusters, it was usually not the primary cause of most service failures. Facebook did a study of their service failures and found that only 10% were related to NameNode failures. Most were the result of human error or systematic bugs on all Hadoop nodes.
8.3.4 Using a managed NoSQL service Organizations find that even with an advanced NoSQL database, it takes a huge amount of engineering to create and maintain predictable high-availability data ser- vices that scale. Unless you have a large IT budget and specialized staff, you’ll find it more cost effective to let companies experienced in database setup and configuration handle the job and let your own staff focus on application development. Today the costs for using cloud-based NoSQL applications are a fraction of what internal IT departments charge to set up and configure systems. Let’s take a look at how an Amazon DynamoDB key-value store can be configured to give you high-availability.
8.3.5 Case study: using Amazon DynamoDB for a high-availability data store The original Amazon DynamoDB paper, introduced in chapter 1, was one of the most influential papers in the NoSQL movement. This paper detailed how Amazon rejected RDBMS designs and used its own custom distributed computing system to sup- port the requirements of horizontal scalability and high availability for their web shop- ping cart. Originally, Amazon didn’t make the DynamoDB software open source. Yet despite the lack of source code, the DynamoDB paper heavily influenced other NoSQL sys- tems such as Cassandra, Redis, and Riak. In February 2012, Amazon made DynamoDB available as a database service for other developers to use. This case study reviews the Amazon DynamoDB service and how it can be used as a fully managed, highly avail- able, scalable database service. Let’s start by looking at DynamoDB’s high-level features. Dynamo’s key innovation is its ability to quickly and precisely tune throughput. The service can reliably handle a large volume of read and write transactions, which can be tuned on a minute-by- minute basis by modifying values on a web page. Figure 8.4 shows an example of this user interface. DynamoDB handles how many servers are used and how the loads are balanced between the servers. Amazon provides an API so you can change the provisioned throughput programmatically based on the results of your load monitoring system. No operator intervention is required. Your monthly Amazon bill will be automatically adjusted as these parameters change.
www.it-ebooks.info NoSQL strategies for high availability 183
Figure 8.4 Amazon DynamoDB table throughput provisioning. By changing the number of read capacity units or write capacity units, you can tune each table in your database to use the correct number of servers based on your capacity needs. Amazon also provides tools to help you calculate the number of units your application will need.
Amazon implements the DynamoDB service using complex algorithms to evenly dis- tribute reliable read and write transactions over tens of thousands of servers. DynamoDB is also unique in that it was one of the first systems to be deployed exclu- sively on solid state disks (SSDs). Using SSDs allows DynamoDB to have a predictable service level. One of the goals of DynamoDB is to provide consistent single-digit millisecond read times. The exclusive use of SSDs means that DynamoDB never has to take disk- access latency into consideration. The net result is your users get consistent responses to all GET and PUT operations, and web pages built with data on DynamoDB will seem faster than most disk-based databases. The DynamoDB API gives you fine-grained control of read consistency. A devel- oper can choose if they want an immediate read of a value from the local node (called an eventually consistent read) or a slower but guaranteed consistent read of an item. The guaranteed consistent read will take a bit longer to make sure that the node you’re reading from has the latest copy of your item. If your application knows that the values haven’t changed, the immediate read will be faster. This fine-grained control of how you read data is an excellent example of how you can use your knowledge of consistency (covered in chapter 2) to fine-tune your appli- cation. It’s important to note that you can always force your reads to be consistent, but it would be challenging to configure SQL-backed data services to integrate this fea- ture. SQL has no option to “make consistent before select” when working with distrib- uted systems. DynamoDB is ideal for organizations that have elastic demand. The notion of pay- ing for what you use is a primary way to save on hosting expenses. Yet when you do
www.it-ebooks.info 184 CHAPTER 8 Building high-availability solutions with NoSQL
need to scale, DynamoDB has the headroom for quick growth. DynamoDB offers scal- able transform Elastic MapReduce, which allows you to move large amounts of data into and out of DynamoDB if you need efficient and scalable extract, transform, and load (ETL) processes. Now that we’ve shown you how NoSQL systems use various strategies to create high-availability database services, let’s look at two NoSQL products with strong repu- tations for high availability. 8.4 Case study: using Apache Cassandra as a high-availability column family store This case study will look at Apache Cassandra, a NoSQL column-family store with a strong reputation for scalability and high availability, even under intense write loads. Cassandra was an early adopter of a pure peer-to-peer distribution model. All nodes in a Cassandra cluster have identical functionality, and clients can write to any node in the cluster at any time. Because Cassandra doesn’t have any single master node, there’s no single point of failure and you don’t have to set up and test a failover node. Apache Cassandra is an interesting combination of NoSQL technologies. It’s some- times described as a Bigtable data model with a Dynamo-inspired implementation. In addition to its robust peer-to-peer model, Cassandra has a strong focus on mak- ing it easy to set up and configure both write and read consistency levels. Table 8.2 shows the various write consistency-level settings that you can use once you’ve config- ured your replication level.
Table 8.2 The codes used to specify consistency levels in Cassandra tables on writes. Each table in Cassandra is configured to meet the consistency levels you need when the table is created. You can change the consistency level at any time and Cassandra will automatically reconfigure to the new settings. There are similar codes for read levels
Level Write guarantee
ZERO (weak consistency) No write confirmation is done. No consistency guarantees. If the server crashes, the write may never actually happen.
ANY (weak consistency) A write confirmation from any single node, including a special “hinted hand-off” node, will be sufficient for the write to be acknowledged.
ONE (weak consistency) A write confirmation from any single node will be sufficient for the write to be acknowledged.
TWO Ensure that the write has been written to at least two replicas before responding to the client.
THREE Ensure that the write has been written to at least three replicas before responding to the client.
QUORUM (strong consistency) N/2 + 1 replicas where N is the replication factor.
LOCAL_QUORUM Ensure that the write has been written to
www.it-ebooks.info Case study: using Apache Cassandra as a high-availability column family store 185
Table 8.2 (continued)
Level Write guarantee
EACH_QUORUM Ensure that the write has been written to
ALL (strong consistency) All replicas must confirm that the data was written to disk.
Next, you consider what to do if one of the nodes is unavailable during a read transac- tion. How can you specify the number of nodes to check before you return a new value? Checking only one node will return a value quickly, but it may be out of date. Checking multiple nodes may take a few milliseconds longer, but will guarantee you get the latest version in the cluster. The answer is to allow the client reader to specify a consistency level code similar to the write codes discussed here. Cassandra clients can select from codes of ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH _QUORUM, and ALL when doing reads. You can even use the EACH_QUORUM code to check multiple data centers around the world before the client returns a value. As you’ll see next, Cassandra uses specific configuration terms that you should understand before you set up and configure your cluster.
8.4.1 Configuring data to node mappings with Cassandra In our discussion of consistent hashing, we introduced the concept of using a hash to evenly distribute data around a cluster. Cassandra uses this same concept of creating a hash to evenly distribute their data. Before we dive into how Cassandra does this, let’s take a look at some key terms and definitions found in the Cassandra system.
ROWKEY A rowkey is a row identifier that’s hashed and used to place the data item on one or more nodes. The rowkey is the only structure used to place data onto a node. No col- umn values are used to place data on nodes. Designing your rowkey structure is a crit- ical step to making sure similar items are clustered together for fast access.
PARTITIONER A partitioner is the strategy that determines how to assign a row to a node based on its key. The default setting is to select a random node. Cassandra uses an MD5 hash of the key to generate a consistent hash. This has the effect of randomly distributing rows evenly over all the nodes. The other option is to use the actual bytes in a rowkey (not a hash of the key) to place the row on a specific node.
KEYSPACE A keyspace is the data structure that determines how a key is replicated over multiple nodes. By default, replication might be set to 3 for any data that needs a high degree of availability.
www.it-ebooks.info 186 CHAPTER 8 Building high-availability solutions with NoSQL
A1 D2 C3
AB D1 B1 Figure 8.5 A sample of how replication works on a C2 A2 Cassandra keyspace using the SimpleStrategy B3 D3 configuration. Items A1, B1, C1, and D1 are written DC to a four-node cluster with a replication factor of 3. Each item is stored on three distinct nodes. After writing to an initial node, Cassandra “walks around the ring” in a clockwise direction until it stores the C1 B2 A3 item on two additional nodes.
An example of a Cassandra keyspace is usually drawn as a ring, as illustrated in figure 8.5. Cassandra allows you to fine-tune replication based on the properties of keyspace. When you add any row to a Cassandra system, you must associate a keyspace with that row. Each keyspace allows you to set and change the replication factor of that row. An example of a keyspace declaration is shown in figure 8.6. Using this strategy will allow you to evenly distribute the rows over all the nodes in the cluster, eliminating bottlenecks. Although you can use specific bits in your key, this practice is strongly discouraged, as it leads to hotspots in your cluster and can be administratively complex. The main disadvantage with this approach is that if you change your partitioning algorithm, you have to save and restore your entire dataset. When you have multiple racks or multiple data centers, the algorithm might need to be modified to ensure data is written to multiple racks or even to multiple data cen- ters before the write acknowledgement is returned. If you change the placement- strategy value from SimpleStrategy to NetworkTopologyStrategy, Cassandra will walk around the ring until it finds a node that’s on a different rack or a different data center. Because Cassandra has a full peer-to-peer deployment model, it’s seen as a good fit for organizations that want both scalability and availability in a column family system. In our next case study, you’ll see how Couchbase 2.0 uses a peer-to-peer model with a JSON document store.
For a single location site, use “SimpleStrategy.” If you have multiple sites, use “NetworkTopologyStrategy.”
CREATEKEYSPACE myKeySpace with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options={replication_factor:3};
We set a replication factor of 3 so that each item will be stored on three separate nodes in the cluster.
Figure 8.6 A sample of how replication is configured within Cassandra. Replication is a property of keys that indicates the type of network you’re on and the replication for each key.
www.it-ebooks.info Case study: using Couchbase as a high-availability document store 187
8.5 Case study: using Couchbase as a high-availability document store Couchbase 2.0 is a JSON document store that uses many of the same replication pat- terns found in other NoSQL databases.
Couchbase vs. CouchDB Couchbase technology shouldn’t be confused with Apache CouchDB. Though both are open source technologies, they’re separate open source projects that have sig- nificantly different capabilities, and they support different application developer needs and use cases. Under the covers, Couchbase has more in common with the original Memcached project than the original CouchDB project. Couchbase and CouchDB share the same way of generating views of JSON documents, but their implementations are different.
Like Cassandra, Couchbase uses a peer-to-peer distribution model, where all nodes provide the same services, thus eliminating the possibility of a single point of failure. Unlike Cassandra, Couchbase uses a document store rather than a column family store, which allows you to query based on a document’s content. Additionally, Couch- base uses the keyspace concept that associates key ranges with individual nodes. Figure 8.7 shows the components in a multidata-center Couchbase server. Couchbase stores document collections in containers called buckets, which are config- ured and administered much like folders in a filesystem. There are two types of buckets: a memcached bucket (which is volatile and resides in RAM), and a Couchbase bucket,
Data center: US West Data center: US East
App server App server App server App server Couchbase client Couchbase client Couchbase client Couchbase client
Cluster map Cluster map Cluster map Cluster map 1 Active Active Active Active
doc doc1 doc doc XDCR doc doc doc doc
Replicas 2 Replicas Replicas Replicas 3 doc doc doc1 doc doc1 doc doc doc
Data server 1 Data server 2 Data server 3 Data server 4
Figure 8.7 High-availability documents in Couchbase. Couchbase buckets are logical collections of documents that are configured for high availability. Couchbase clients use cluster maps to find the physical node where the active version of a document is stored. The cluster map directs the client to a file on the active node (1). If Data server 1 is unavailable, the cluster map will make a replica of doc1 on Data server 2 the active version (2). If the entire US West data center goes down, the client will use cross data center replication (XDCR) and make a copy on Data server 3 active (3) from the US East region.
www.it-ebooks.info 188 CHAPTER 8 Building high-availability solutions with NoSQL
which is backed up by disk and configured with replication. A Couchbase JSON docu- ment is written to one or more disks. For our discussions on high-availability systems, we’ll focus on Couchbase buckets. Internally, Couchbase uses a concept called a vBucket (virtual bucket) that’s associ- ated with one or more portions of a hash-partitioned keyspace. Couchbase keyspaces are similar to those found in Cassandra, but Couchbase keyspace management is done transparently when items are stored. Note that a vBucket isn’t a single range in a key- space; it may contain many noncontiguous ranges in keyspaces. Thankfully, users don’t need to worry about managing keyspaces or how vBuckets work. Couchbase cli- ents simply work with buckets and let Couchbase worry about what node will be used to find the data in a bucket. Separating buckets from vBuckets is one of the primary ways that Couchbase achieves horizontal scalability. Using information in the cluster map, Couchbase stores data on a primary node as well as a replica node. If any node in a Couchbase cluster fails, the node will be marked with a failover status and the cluster maps will all be updated. All data requests to the node will automatically be redirected to replica nodes. After a node fails and replicas have been promoted, users will typically initiate a rebalance operation to add new nodes to the cluster to restore the full capacity of the cluster. Rebalancing effectively changes the mapping of vBuckets to nodes. During a rebalance operation, vBuckets are evenly redistributed between nodes in the cluster to minimize data movement. Once a vBucket has been re-created on the new node, it’ll be automatically disabled on the original node and enabled on the new node. These functions all happen without any interruption of services. Couchbase has features to allow a Couchbase cluster to run without interruption even if an entire data center fails. For systems that span multiple data centers, Couch- base uses cross data center replication (XDCR), which allows data to automatically be repli- cated between remote data centers and still be active in both data centers. If one data center becomes unavailable, the other data center can pick up the load to provide continuous service. One of the greatest strengths of Couchbase is the built-in, high-precision monitor- ing tools. Figure 8.8 shows a sample of these monitoring tools. These fine-grained monitoring tools allow you to quickly locate bottlenecks in Couchbase and rebalance memory and server resources based on your loads. These tools eliminate the need to purchase third-party memory monitoring tools or config- ure external monitoring frameworks. Although it takes some training to understand how to use these monitoring tools, they’re the first line of defense when keeping your Couchbase clusters healthy. Couchbase also has features that allow software to be upgraded without an inter- ruption in service. This process involves replication of data to a new node that has a new version of software and then cutting over to that new node. These features allow you to provide a 24/365 service level to your users without downtime.
www.it-ebooks.info Summary 189
Figure 8.8 Couchbase comes with a set of customizable web-based operations monitoring reports to allow you to see the impact of loads on Couchbase resources. The figure shows a minute-by-minute, operations-per-second display on the default bucket. You can select from any one of 20 views to see additional information.
8.6 Summary In this chapter, you learned how NoSQL systems can be configured to create highly available data services for key-value stores, column family stores, and document stores. Not only can NoSQL data services be highly available, they can also be tuned to meet precise service levels at reasonable costs. We’ve looked at how NoSQL databases lever- age distributed filesystems like Hadoop with fine-grained control over file replication. Finally, we’ve reviewed some examples of how turnkey data services have been created to take advantage of NoSQL architectures. Organizations have found that high-availability NoSQL systems that run on multi- ple processors can be more cost-effective than RDBMSs, even if the RDBMSs are config- ured for high availability. The principal reason has to do with the use of simple
www.it-ebooks.info 190 CHAPTER 8 Building high-availability solutions with NoSQL
distributed data stores like key-value stores. Key-value stores don’t use joins; they lever- age consistent hashing; and they have strong scale-out properties. Simplicity of design frequently promotes high availability. With all the architectural advantages of NoSQL for creating cost-effective, high- availability databases, there are drawbacks as well. The principal drawback is that NoSQL systems are relatively new and may contain bugs that become apparent in rare circumstances or unusual configurations. The NoSQL community is full of stories where high-visibility web startups experienced unexpected downtimes using new ver- sions of NoSQL software without adequate training of their staff and enough time and budget to do load and stress testing. Load and stress testing take time and resources. To be successful, your project may need the people with the right training and experience using the same tools and con- figuration you have. With NoSQL still newer than traditional RDBMSs, the training budgets for your staff need to be adjusted accordingly. In our next chapter, you’ll see how using NoSQL systems will help you be agile with respect to developing software applications to solve your business problems. 8.7 Further reading “About Data Partitioning in Cassandra.” DataStax. http://mng.bz/TI33. “Amazon DynamoDB.” Amazon Web Services. http://aws.amazon.com/ dynamodb. “Amazon DynamoDB: Provisioned Throughput.” Amazon Web Services. http://mng.bz/492J. “Amazon S3 Service Level Agreement.” Amazon Web Services. http://aws.amazon.com/s3-sla/. “Amazon S3—The First Trillion Objects.” Amazon Web Services Blog. http://mng.bz/r1Ae. Apache Cassandra. http://cassandra.apache.org. Apache JMeter. http://jmeter.apache.org/. Brodkin, Jon. “Amazon bests Microsoft, all other contenders in cloud storage test.” Ars Technica. December 2011. http://mng.bz/ItNZ. “Data Protection.” Amazon Web Services. http://mng.bz/15yb. DeCandia, Giuseppe, et al. “Dynamo: Amazon’s Highly Available Key-Value Store.” Amazon.com. 2007. http://mng.bz/YY5A. Hale, Coda. “You Can’t Sacrifice Partition Tolerance.” October 2010. http://mng.bz/9i3I. “High Availability.” Neo4j. http://mng.bz/9661. “High-availability cluster.” Wikipedia. http://mng.bz/SHs5. “In Search of Five 9s: Calculating Availability of Complex Systems.” edgeblog. Octo- ber 2007. http://mng.bz/3P2e.
www.it-ebooks.info Further reading 191
Luciani, Jake. “Cassandra File System Design.” DataStax. February 2012. http://mng.bz/TfuN. Ryan, Andrew. “Hadoop Distributed Filesystem reliability and durability at Face- book.” Lanyrd. June 2012. http://mng.bz/UAX9. “Tandem Computers.” Wikipedia. http://mng.bz/yljh. Vogels, Werner. “Amazon DynamoDB—A Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications.” January 2012. http://mng.bz/ HpN9.
www.it-ebooks.info Increasing agility with NoSQL
This chapter covers
Measuring agility How NoSQL increases agility Using document stores to avoid object-relational mapping
Change is no longer just incremental. Radical “nonlinear change” which brings about a different order is becoming more frequent. —Cheshire Henbury, “The Problem”
Can your organization quickly adapt to changing business conditions? Can your computer systems rapidly respond to increased workloads? Can your developers quickly add features to your applications to take advantage of new business oppor- tunities? Can nonprogrammers maintain business rules without needing help from software developers? Have you ever wanted to build a web application that works with complex data, but you didn’t have the budget for teams of database modelers, SQL developers, database administrators, and Java developers?
192
www.it-ebooks.info What is software agility? 193
If you answered yes to any of these questions, you should consider evaluating a NoSQL solution. We’ve found that NoSQL solutions can reduce the time it takes to build, scale, and modify applications. Whereas scalability is the primary reason compa- nies move away from RDBMSs, agility is the reason NoSQL solutions “stick.” Once you’ve experienced the simplicity and flexibility of NoSQL, the old ways seem like a chore. As we move through this chapter, we’ll talk about agility. You’ll learn about the challenges one encounters when attempting to objectively measure it. We’ll quickly review the problems encountered when trying to store documents in a relational database and the problems associated with object-relational mapping. We’ll close out the chapter with a case study that uses a NoSQL solution to manage complex busi- ness forms. 9.1 What is software agility? Let’s begin by defining software agility and talk about why businesses use NoSQL tech- nologies to quickly build new applications and respond to changes in business requirements. We define software agility as the ability to quickly adapt software systems to chang- ing business requirements. Agility is strongly coupled with both operational robust- ness and developer productivity. Agility is more than rapidly creating new applications; it means being able to respond to changing business rules without rewrit- ing code. To expand, agility is the ability to rapidly
Build new applications Scale applications to quickly match new levels of demand Change existing applications without rewriting code Allow nonprogrammers to create and maintain business logic From the developer productivity perspective, agility includes all stages of the software development lifecycle (SDLC) from creating requirements, documenting use cases, and creating test data to maintaining business rules in existing applications. As you may know, some of these activities are handled by staff who aren’t traditionally thought of as developers or programmers. From a NoSQL perspective, agility can help to increase the productivity of programmers and nonprogrammers alike. Traditionally, we think of “programmers” as staff who have a four-year degree in computer science or software engineering. They understand the details of how
Agility vs. agile development Our discussion of agility shouldn’t be confused with agile development, which is a set of guidelines for managing the software development process. Our focus is the impact of database architecture on agility.
www.it-ebooks.info 194 CHAPTER 9 Increasing agility with NoSQL
computers work and are knowledgeable about issues related to memory allocation, pointers, and multiple languages like Java, .Net, Perl, Python, and PHP. Nonprogrammers are people who have exposure to their data and may have some experience with SQL or writing spreadsheet macros. Nonprogrammers focus on get- ting work done for the business; they generally don’t write code. Typical nonprogram- mer roles might include business analyst, rules analyst, data quality analyst, or quality assurance. There’s a large body of anecdotal evidence that NoSQL solutions have a positive impact on agility, but there are few scientific studies to support the claim. One study, funded by 10gen, the company behind MongoDB, found that more than 40% of the organizations using MongoDB had a greater than 50% improvement in developer productivity. These results are shown in figure 9.1. When you ask people why they think NoSQL solutions increase agility, many rea- sons are cited. Some say NoSQL allows programmers to stay focused on their data and build data-centric solutions; others say the lack of object-relational mapping opens up opportunities for nonprogrammers to participate in the development process and shorten development timelines, resulting in greater agility. By removing object-relational mapping, someone with a bit of background in SQL, HTML, or XML can build and maintain their own web applications with some training. After this training, most people are equipped to perform all of the key operations such as create, read, update, delete, and search (CRUDS) on their records. Programmers also benefit from no object-relational mapping, as they can move their focus from mapping issues to creating automated tools for others to use. But the impact of all these time- and money-saving NoSQL trends puts more pressure on an enterprise solution architect to determine whether NoSQL solutions are right for the team. If you’ve spent time working with multilayered software architectures, you’re likely familiar with the challenges of keeping these layers in sync. User interfaces, middle tier objects, and databases must all be updated together to maintain consistency. If any layer is out of sync, the systems fail. It takes a great deal of time to keep the layers in sync. The time to sync and retest each of the layers slows down a team and hampers agility. NoSQL architectures promote agility because there are fewer layers of software,
Figure 9.1 Results of 10gen survey of their users By what percentage has MongoDB increased the productivity of your development team? show that more than 40% of development teams Greater than 75% 8% using MongoDB had a greater than 50% increase 50% – 74% 33% in productivity. This study 25% – 49% 34% included 61 organizations and the data was validated 10% – 24% 20% in May of 2012. (Source: TechValidate. TVID: F1D- Less than 10% 5% 0F5-7B8)
www.it-ebooks.info What is software agility? 195
and changes in one layer don’t cause problems with other layers. This means your team can add new features without the need to sync all the layers. NoSQL schema-less datasets usually refers to datasets that don’t require predefined tables, columns (with types), and primary-foreign key relationships. Datasets that don’t require these predefined structures are more adaptable to change. When you first begin to design your system, you may not know what data elements you need. NoSQL systems allow you to use new elements and associate the data types, indexes, and rules to new elements when you need them, not before you get the data. As new data elements are loaded into some NoSQL databases, indexes are automatically cre- ated to identify this data. If you add a new element for PersonBirthDate anywhere in a JSON or XML input file, it’ll be added to an index of other PersonBirthDate ele- ments in your database. Note that a range index on dates for fast sorting many still need to be configured. To take this a step further, let’s look at how specific NoSQL data services can be more agile than an entire RDBMS. NoSQL systems frequently deliver data services for specific portions of a large web- site or application. They may use dozens of CPUs working together to deliver these ser- vices in configurations that are designed to duplicate data for faster guaranteed response times and reliability. The NoSQL data service CPUs are often dedicated to these services and no other functions. As the requirements for performance and reli- ability change, more CPUs can be automatically added to share the load, increase response time, and lower the probability of downtime. This architecture of using dedicated NoSQL servers to create highly tunable data services is in sharp contrast to traditional RDBMSs that typically have hundreds or thousands of tables all stored on a single CPU. Trying to create precise data service lev- els for one service can be difficult if not impossible if you consider that some data ser- vices will be negatively impacted by large query loads on other unrelated database tables. NoSQL data architectures, when combined with distributed processing, allow organizations to be more agile and resilient to the changing needs of businesses. Our focus in this chapter is the impact of NoSQL database architecture on overall software agility. But before we wrap up our discussion of defining agility as it relates to NoSQL architecture, let’s take a look at how deployment strategies also impact agility.
9.1.1 Apply your knowledge: local or cloud-based deployment? Sally is working on a project that has a tight budget and a short timeline. The organi- zation she works for prefers to use database servers in their own data center, but in the right situation they allow cloud-based deployments. Since the project is a new service, the business unit is unable to accurately predict either the demand for the service or the throughput requirements. Sally wants to consider the impact of a cloud-based deployment on the project’s scalability and agility. She asks a friend in operations how long it typically takes for the internal information technology department to order and configure a new database server. She gets an email message with a link to a spreadsheet shown in figure 9.2. This
www.it-ebooks.info 196 CHAPTER 9 Increasing agility with NoSQL
Figure 9.2 Average time required to provision a new database server in a typical large organization. Because NoSQL servers can be deployed as a managed service, a month-long period of time can be dropped to a few minutes if not seconds to change the number of nodes in a cluster.
figure shows a typical estimate of the steps Sally’s information technology department uses to provision a new database server. As you can see by the Total Business Days calculation, it’ll take 19 business days or about a month to complete the project. This stands in sharp contrast to a cloud-based NoSQL deployment that can add capacity in minutes or even seconds. The company does have some virtual machine–based options, but there are no clear guarantees of average response times for the virtual machine options. Sally opts to use a cloud-based deployment for her NoSQL database for the first year of the project. After the first year, the business unit will reevaluate the costs and compare these with internal costs. This allows the team to quickly move forward with their scale-out testing without incurring up-front capital expenditures associated with ordering and configuring up to a dozen database servers. Our goal in this chapter is not to compare local versus cloud-based deployment methods. It’s to understand how NoSQL architecture impacts a project’s development speed. But the choice to use a local or cloud-based deployment should be a consider- ation in any project. In chapter 1 we talked about how the business drives of volume, velocity, variability, and agility were the drivers associated with the NoSQL movement. Now that you’re familiar with these drivers, you can look at your organization to see how NoSQL solu- tions might positively impact these drivers to help your business meet the changing demands of today’s competitive marketplace. 9.2 Measuring agility Understanding the overall agility of a project/team is the first step in determining the agility associated with one or more database architectures. We’ll now look at devel- oper agility to see how it can be objectively measured. Measuring pure agility in the NoSQL selection process is difficult since it’s inter- twined with developer training and tools. A person who’s an expert with Java and SQL might create a new web application faster than someone who’s a novice with a NoSQL system. The key is to remove the tools and staff-dependent components from the mea- surement process.
www.it-ebooks.info Measuring agility 197
Background, training, motivation Developer
What we can measure
App Developer Tools layer IDE generators tools
Language-specific Language layer SQL MDX XQuery APls and libraries What we would like to measure Interface layer CLI REST SOAP JDBC ODBC RPC
Database layer Database architecture
Figure 9.3 The factors that make it challenging to measure the impact of database architecture on overall software agility. Database architecture is only a single component of the entire SDLC ecosystem. Developer agility is strongly influenced by an individual’s background, training, and motivation. The tools layer includes items such as the integrated development environment (IDE), app generators, and developer tools. The interface layer includes items such as the command-line interface (CLI) as well as interface protocols.
An application development architecture’s overall software agility can be precisely measured. You can track the total number of hours it takes to complete a project using both an RDBMS and a NoSQL solution. But measuring the relationship between the database architecture and agility is more complex, as seen in figure 9.3. Figure 9.3 shows how the database architecture is a small part of an overall soft- ware ecosystem. The diagram identifies all the components of your software architec- ture and the tools that support it. The architecture has a deep connection with the complexity of the software you use. Simple software can be created and maintained by a smaller team with fewer specialized skills. Simplicity also requires less training and allows team members to assist each other during development. To determine the relationship between the database architecture and agility, you need a way to subtract the nondatabase architecture components that aren’t relevant. One way to do this is to develop a normalization process that tries to separate the unimportant processes from agility measurements. This process is shown in figure 9.4. This model is driven by selecting key use cases from your requirements and analyz- ing the amount of effort required to achieve your business goals. Although this sounds complicated, once you’ve done this a few times, the process seems straightforward. Let’s use the following example. Your team has a new project that involves import- ing XML data and creating RESTful web services that return only portions of this data using a search service. Your team meets and talks about the requirements, and the development staff creates a high-level outline of the steps and effort required. You’ve
www.it-ebooks.info 198 CHAPTER 9 Increasing agility with NoSQL
Development tools
Developer training Objective normalization Agility metrics Architectures
Use cases
Figure 9.4 Factors such as development tools, training, architectures, and use cases all impact developer agility. In order to do a fair comparison of the impact of NoSQL architecture on agility, you need to normalize the non-architecture components. Once you balance these factors, you can compare how different NoSQL architectures drive the agility of a project.
narrowed down the options to a native XML database using XQuery and an RDBMS using a Java middle tier. For the sake of simplicity, effort is categorized using a scale of 1 to 5, where 1 is the least effort and 5 is the most effort. A sample of this analysis is shown in table 9.1.
Table 9.1 High-level effort analysis to build a RESTful search service from an XML dataset. The steps to build the service are counted and a rough effort level (1-5) is used to measure the difficulty of each step.
NoSQL document store SQL/Java method
1. Drag and drop XML file into data- 1. Inventory all XML elements (2) base collection (1) 2. Design data model (5) 2. Write XQuery (2) 3. Write create table statements (5) 3. Publish API document (1) 4. Execute create table (1) Total effort: 4 units 5. Convert XML into SQL insert statements (4) 6. Run load-data scripts (1) 7. Write SQL scripts to query data (3) 8. Create Java JDBC program to query data and Java REST pro- grams to convert SQL results into XML (5) 9. Compile Java program and install on middle-tier server (2) 10. Publish API document (1) Total effort: 29 units
Performing this type of analysis can show you how suitable an architecture is for a par- ticular use case. Large projects may have many use cases, and you’ll likely get conflict- ing results. The key is to involve a diverse group of people to create a fair and objective estimate of the total effort that’s decoupled from background and training issues. The amount of time you spend looking at the effort involved in each use case is up to you and your team. Informal “thought experiments” work well if the team has peo- ple with adequate experience in each alternative database and a high level of trust
www.it-ebooks.info Using document stores to avoid object-relational mapping 199
exists in the group’s ability to fairly compare effort. If there are disputes on the rela- tive effort, you may need to write sample code for some use cases. Although this takes time and slows a selection process, it’s sometimes necessary to fairly evaluate alterna- tives and validate objective effort comparisons. Many organizations that are comparing RDBMSs and document stores find a signif- icant reduction in effort in use cases that focus on data import and export. In our next section, we’ll focus on a detailed analysis of the translations RDBMSs need to convert data to and from natural business objects. 9.3 Using document stores to avoid object-relational mapping You may be familiar with the saying “A clever person solves a problem. A wise person avoids it.” Organizations that adopt document stores to avoid an object-relational layer are wise indeed. The conversion of object hierarchies to and from rigid tabular structures can be one of most vexing problems in building applications. Avoiding the object- relational layer mapping is a primary reason developer productivity increases when using NoSQL systems. Early computer systems used in business focused on the management of financial and accounting data. This tabular data was represented in a flat file of consistent rows and columns. For example, financial systems stored general ledger data in a series of columns and rows that represented debits and credits. These systems were easy to model, the data was extremely consistent, and the variability between how customers stored financial information was minimal. Later, other departments began to see how storing and analyzing data could help them manage inventory and make better business decisions. For many departments, the types of data captured and stored needed to change. A simple tabular structure could no longer meet the organization’s needs. Engineers did what they do best: they attempted to work within their existing structure to accommodate the new require- ments. After all, they had a sizeable investment in people and systems, so they wanted to use the same RDBMS used in accounting. As things evolved, business components were represented in a middle tier using object models. Object models were more flexible than the original punch card mod- els, as they naturally represented the way business entities were stored. Objects could contain other objects, which could in turn contain additional objects. To save the state of an object, many SQL statements would need to be generated to save and reassemble objects. In the late 1990s these objects also needed to be viewed and edited using a web browser, sometimes requiring additional translations. The most common design was to create a four-step translation process, as shown in figure 9.5. If you think this four-step translation process is complex, it can be. Let’s see how you might compare the four-translation pain to putting away your clothing.
www.it-ebooks.info 200 CHAPTER 9 Increasing agility with NoSQL
T1 T2
T4 T3 Web browser Middle-tier Relational object layer database
Figure 9.5 The four-translation web-object-RDBMS model. This model is used when objects are used as a middle layer between a web page and a relational database. The first translation (T1) is the conversion from HTML web pages to middle-tier objects. The second translation (T2) is a conversion from middle tier objects to relational database statements such as SQL. RDBMSs return only tables so the third translation (T3) is the transformation of tables back into objects. The fourth translation (T4) is converting objects into HTML for display in web pages.
Imagine using this process on a daily basis as you return home from a long day at the office. You’d take off your clothes and start by removing all of the thread, dismantling your clothes bit by bit, and then putting them away in uniform bolts of cloth. When you get up in the morning to go to work, you’d then retrieve your needle and thread and re-sew all your clothes back together again. If you’re thinking, “This seems like a lot of unnecessary work,” that’s the point of the example. Today, NoSQL document stores allow you to avoid the complexities that were caused by the original require- ments to store flat tabular data. They really allow development teams to avoid a lot of unnecessary work. There have been multiple efforts to mitigate the complexities of four-step transla- tion. Tools like Apache Hibernate and Ruby on Rails are examples of frameworks of tools that try to manage the complexity of object-relational mapping. These were the only options available until developers realized that using a NoSQL solution to store the document structure directly in the database without converting it to another for- mat or shredding it into tables and rows is a better solution. This lack of translation makes NoSQL systems simpler to use and, in turn, allows subject matter experts (SMEs) and other nonprogramming staff to participate directly in the application development process. By encouraging SMEs to have a direct involve- ment in building applications, course corrections can be made early in the software development process, saving time and money associated with rework. NoSQL technologies show how moving from storing data in tables to storing data in documents opens up possibilities for new ways of using and presenting data. As you move your systems out of the back room to the World Wide Web, you’ll see how NoSQL solutions can make implementation less painful. Next we’ll look at combining a no-translation architecture with web standards to create a development platform that’s easy to use and portable across multiple NoSQL platforms.
www.it-ebooks.info Case study: using XRX to manage complex forms 201
9.4 Case study: using XRX to manage complex forms This case study will look at how zero translation can be used to store complex form data. We’ll look at the characteristics of complex forms, how XForms works, and how the XRX web application architecture uses a document store to allow nonprogram- mers to build and maintain these form applications. XRX stands for the three standards that are used to build web form–based applica- tions. The first X stands for XForms; the R stands for REST, and the final X stands for XQuery, the W3C server-side functional programming language we introduced you to in chapter 5.
9.4.1 What are complex business forms? If you’ve built HTML form applications, you know that building simple forms like a login page or registration form is simple and straightforward. Generally, simple forms have a few fields, a selection list, and a Save button, all of which can be built using a handful of HTML tags. This case study looks at a complex class of forms that need more than simple HTML elements. These complex forms are similar to those you’ll find in a large com- pany or perhaps a shopping cart form on a retail website. If you can store your form data in a single row within an RDBMS table, then you might not have a complex business application as defined here. Using simple HTML and a SQL INSERT statement may be the best solution for your problem. But our expe- rience is that there’s a large class of business forms that go beyond what HTML forms can do. Complex business forms have complex data and also have complex user interfaces. They share some of the following characteristics:
Repeating elements—Conditionally adding two or more items to a form. For example, when you enter a person they may have multiple phone numbers, interests, or skills. Conditional views—Conditionally enabling a region of a form based on how the user fills out various fields. For example, the question “Are you pregnant?” should be disabled if a patient’s gender is male. Cascading selection—Changing one selection list based on the value of another list. For example, if you select a country code, a list of state or province codes will appear for that country. Field-by-field validation—Business rules check the values of each field and give the user immediate feedback. Context help—Fields have help and hint text to guide users through the selection process. Role-based contextualization—Each role in an organization might see a slightly dif- ferent version of the form. For example, only users with a role of Publisher might see an Approve for Publishing button.
www.it-ebooks.info 202 CHAPTER 9 Increasing agility with NoSQL
Type-specific controls—If you know an element is a specific data type, XForms can automatically change the user interface control. Boolean true/false values appear as true/false check boxes and date fields have calendars that allow you to pick the date. Autocomplete—As users enter characters in a field, you want to be able to suggest the remainder of the text in a field. This is also known as autosuggest. Although these aren’t the only features that XForms supports, they give you some idea of the complexity involved in creating forms. Now let’s see how these features can be added to forms without the need for complex JavaScript.
9.4.2 Using XRX to replace client JavaScript and object-relational mapping All of the features listed in the prior section could be implemented by writing JavaScript within the web browser. Your custom JavaScript code would send your objects to an object-relational mapping layer for storage within an RDBMS, which is how many enterprise forms are created today. If you and your team are experienced JavaScript developers and know your way around object-relational frameworks, this is one option for you to consider. But NoSQL document stores provide a much simpler option. With XRX, you can develop these complex forms without using JavaScript or an object-relation layer. First let’s review some terminology and then see how XRX can make this process simpler. Figure 9.6 shows the three main components of a form: model, binding, and view. When a user fills out a form, the data that’s stored when they click Save is called the model. The components that are on the screen and visible to the user, including items in a selection list, are referred to as the view components of a form. The process of associating the model elements with the view is called binding. We call this a model-view forms architecture. To use a form in a web browser, you must first move the default or existing form data from the database to the model. After the model data is in place, the form takes over and manages movement of data from the model to and from the views using the bindings. As a user enters information into a field, the model data is updated. When
Model Binding View
Figure 9.6 The main components in a form. The model holds the business data to be saved to the server. The view displays the individual fields, including selection list options. The binding associates the user interface fields with model elements.
www.it-ebooks.info Case study: using XRX to manage complex forms 203
Logic to validate field-level data Figure 9.7 The breakdown of code and display rules Object-relational used in a typical forms applications mapping when an RDBMS, middle-tier objects, Updating view elements from and JavaScript are used. Middle-tier model objects are moved to and from a model within the client. XRX attempts to automate all but the critical Binding model AJAX calls business logic components of forms to view and libraries elements processing (top-left wedge). the user selects Save, the updated model data is stored on the server. Figure 9.7 shows a typical breakdown of code needed to implement business forms using a standard model and view architecture. From the figure, you can see that close to 90% of this code is similar for most busi- ness forms and involves moving data between layers. This code does all the work of getting data in and out of the database and moving the data around the forms as users fill in the fields. These steps include transferring the data to a middle tier, then mov- ing the code into the model of the browser, and finally, when the user clicks Save, the path is reversed. XRX attempts to automate all generic code that moves the data. In turn, the soft- ware developer focuses on selecting and validating field-level controls as the user enters the data and the logic to conditionally display the form. This architecture has a positive impact on agility, and it empowers nonprogrammers to build robust applica- tions without ever learning JavaScript or object- Browser XML database relational mapping systems. Model Figure 9.8 shows how this works within the XRX Save web application architecture. Update With XForms, the entire structure that you’re editing is stored as an XML document within the View model of the browser. On the server side, docu- ment stores, such as a native XML database, are used to store these structures. The beauty of this architecture is there’s no conversion or need for Figure 9.8 How XML files from a object-relational mapping; everything fits together native XML database are loaded into like a hand in a glove. the model of an XForms application. Most native XML databases come with a built-in This is an example of a zero-translation architecture. Once the XML data is REST interface. This is where the R in XRX comes loaded into the XForms model, views in. XForms allows you to add a single XML element will allow the user to edit model called
www.it-ebooks.info 204 CHAPTER 9 Increasing agility with NoSQL
XForms is an example of a declarative domain-specific language customized for data entry forms. Like other declarative systems, XForms allows you to specify what you want the forms to do but tries to avoid the details of how the forms perform the work. XForms is a collection of approximately 25 elements that are used to implement complex business forms. A sample of XForms markup is shown in figure 9.9. Because XForms is a W3C standard, it reuses many of the same standards found in XQuery such as XPath and XML Schema datatypes. XForms works well with native XML databases. There are several ways to implement an XForms solution. You can use a number of open source XForms frameworks (Orbeon, BetterForm, XSLTForms, OpenOffice 3) or a commercial XForms framework (IBM Workplace, EMC XForms). These tools work with web browsers to interpret the XForms elements and convert them into the appropriate behavior. Some XForms tools from Orbeon, OpenOffice, and IBM also have form-builder tools that use XForms to lay out and connect the form elements to your model. After you’ve created your form, you need to be able to save your form data to the database. This is where using a native XML database shines. XForms stores the data
Figure 9.9 This figure shows how data from a native XML database is loaded into the model of XForms. Once the XML data is loaded into the model, XForms views will allow the user to edit the model content in a web form. When the user selects Save, the model is saved without change directly into the database. No object-relational mapping is needed.
www.it-ebooks.info Summary 205
you’ve entered in the form into a single XML document. Most save operations into a native XML database can be done with a single line of code, even if your form is complex. The real strength of XRX emerges when you need to modify your application. Add- ing a new field can be done by adding a new element to the model and a few lines of code to the view elements. That’s it! No need to change the database, no recompila- tion of middle-tier objects, and no need to write additional JavaScript code within the browser. XRX keeps things simple.
9.4.3 Understanding the impact of XRX on agility Today, we live in a world where many information technology departments aren’t focused on making it easy for business users to build and maintain their own applica- tions. Yet this is exactly what XRX does. Form developers are no longer constrained by IT development schedules of overworked staff who don’t have time to learn the com- plex business rules of each department. With some training, many users can maintain and update their own forms applications. XRX is a great example of how simplicity and standards drive agility. The fewer components, the easier it is to build new and change existing forms. Because stan- dards are reused, your team doesn’t have to learn a new JavaScript library or a new data format. XRX and NoSQL can have a transformative impact on the way work is divided in a project. It means that information technology staff can focus on other tasks rather than updating business rules in Java and JavaScript every time there’s a change request. The lessons you’ve learned with XRX can be applied to other areas as well. If you have a JSON document store like MongoDB, Couchbase, or CouchDB, you can build JSON services to populate XForms models. There are even versions of XForms that load and save JSON documents. The important architecture element is the elimina- tion of the object-relational layer and the ability to avoid JavaScript programs on the client. If you do this, then your SMEs can take a more active role in your projects and increase organizational agility. 9.5 Summary The primary business driver behind the NoSQL movement was the need for graceful horizontal scalability. This forced a break from the past, and opened new doors to innovative ways of storing data. It allowed innovators to build mature systems that facilitate agile software development. The phrase “We came for the scalability—we stayed for the agility” is an excellent summary of this process. In this chapter, we stressed that comparing agility of two database architecture alternatives is difficult, because overall agility is buried in the software developer’s stack. Despite the challenges, we feel it’s worthwhile to create use-case-driven thought experiments to help guide objective evaluations.
www.it-ebooks.info 206 CHAPTER 9 Increasing agility with NoSQL
Almost all NoSQL systems demonstrate excellent scale-out architecture for opera- tional agility. Simple architectures that drive complexity out of the process enable agility in many areas. Our evidence shows that both key-value stores and document stores can provide huge gains in developer agility if applied to the right use cases. Native XML systems that work with web standards also work well with other agility- enhancing systems and empower nonprogrammers. This is the last of our four chapters on building NoSQL solutions for specific types of business problems. We’ve focused on big data, search, high availability, and agility. Our next chapter takes a different track. It’ll challenge you to think about how you approach problem solving, and introduce new thinking styles that lend themselves to parallel processing. We’ll then wrap up with a discussion on security before we look at the formal methods of system selection. 9.6 Further reading Henbury, Cheshire. “The Problem.” http://mng.bz/9I9e. “Hibernate (Java).” Wikipedia. http://mng.bz/tEr6. Hodges, Nick. “Developer Productivity.” November 2012. http://mng.bz/6c3q. Hugos, Michael. Business Agility. Wiley, 2009. http://mng.bz/6z3I. ——— “IT Agility Means Simple Things Done Well, Not Complex Things Done Fast.” CIO, International Data Group, February 2008. http://mng.bz/dlzN. “JSON Processing Using XQuery.” Progress Software. http://mng.bz/aMkn. TechValidate. 10gen, MongoDB, productivity chart. March 2012. http://mng.bz/gH0M. “JSON Processing Using XQuery.” Progress Software. http://mng.bz/aMkn. Robie, Jonathan. “JSONiq: XQuery for JSON, JSON for XQuery.” 2012 NoSQL Now! Conference. http://mng.bz/uBSe. “Ruby on Rails.” Wikipedia. http://mng.bz/7q73.
www.it-ebooks.info Part 4 Advanced topics
In part 4 we look at two advanced topics associated with NoSQL: functional programming and system security. In our capstone chapter we combine all of the information we’ve covered so far and walk you through the process of selecting the right SQL, NoSQL, or some combination of systems solution for your project. At this point the relationship between NoSQL and horizontally scalable sys- tems should be clear. Chapter 10 covers how functional programming, REST, and actor-based frameworks can be used to make your software more scalable. In chapter 11 we compare the fine-grained control of RDBMS systems to doc- ument-related access control systems. We also look at how you might control access to NoSQL data using views, groups, and role-based access controls (RBAC) mechanisms to get the level of security your organization requires. In our final chapter we take all of the concepts we’ve explored so far and show you how to match the right database to your business problem. We cover the architectural trade-off process from gathering high-level requirements to communicating results using quality trees. This chapter provides you with a roadmap for selecting the right NoSQL system for your next project.
www.it-ebooks.info www.it-ebooks.info NoSQL and functional programming
This chapter covers
Functional programming basics Examples of functional programming Moving from imperative to functional programming
The world is concurrent. Things in the world don’t share data. Things communicate with messages. Things fail. —Joe Armstrong, cocreator of Erlang
In this chapter, we’ll look at functional programming, the benefits of using a func- tional programming language, and how functional programming forces you to think differently when creating and writing systems. The transition to functional programming requires a paradigm shift away from software designed to control state and toward software that has a focus on indepen- dent data transformation. Most popular programming languages used today, such as C, C++, Java, Ruby, and Python, were written with the needs of a single node as a target platform in mind. Although the compilers and libraries for these languages
209
www.it-ebooks.info 210 CHAPTER 10 NoSQL and functional programming
do support multiple threads on multicore CPUs, the languages and their libraries were created before NoSQL and horizontal scalability on multiple node clusters became a business requirement. In this chapter, we’ll look at how organizations are using lan- guages that focus on isolated data transformations to make working with distributed systems easier. In order to meet the needs of modern distributed systems, you must ask yourself how well a programming language will allow you to write applications that can expo- nentially scale to serve millions of users connected on the web. It’s no longer suffi- cient to design a system that will scale to 2, 4, or 8 core processors. You need to ask if your architecture will scale to 100, 1,000, or even 10,000 processors. As we’ve discussed throughout this book, most NoSQL solutions have been specifi- cally designed to work on many computers. It’s the hallmark of horizontal scalability to keep all processors in your cluster working together and adapting to changes in the cluster automatically. Adding these features after the fact is usually not possible. It must be part of the initial design, in the lowest levels of your application stack. The inability of SQL joins to scale out is an excellent example of how retrofits don’t work. Some software architects feel that to make the shift to true horizontal scale-out, you need to make a paradigm shift at the language and runtime library level. This is the shift from a traditional world of object and procedural programming to functional programming. Today most NoSQL systems are embracing the concepts of functional programming, even if they’re using some traditional languages to implement the lower-level algorithms. In this chapter, we’ll look into the benefits of the functional programming paradigm and show how it’s a significant departure from the way things are taught in most colleges and universities today. 10.1 What is functional programming? To understand what functional programming is and how it’s different from other pro- gramming methods, let’s look at how software paradigms are classified. A high-level taxonomy of software paradigms is shown in figure 10.1. We’ll first look at how most popular languages today are based on managing pro- gram state and memory values. We’ll contrast this with functional programming, which has a focus on data transformation. We’ll also look at how functional program- ming systems are closely matched with the requirements of distributed systems. After reading this chapter, you’ll be able to visualize functional programming as isolated transformations of data flowing through a series of pipes. If you can keep this model in mind, you’ll understand how these transforms can be distributed over multi- ple processors and promote horizontal scalability. You’ll also see how side effects pre- vent systems from achieving these goals.
www.it-ebooks.info What is functional programming? 211
Commands that change state “What” not “how” Software paradigms Wrap statements Transforms with in procedures no side effects Imperative Declarative
Procedural Object-oriented Functional
Wrap statements in objects
Figure 10.1 A high-level taxonomy of software paradigms. In the English language, an imperative sentence is a sentence that expresses a command. “Change that variable now!” is an example of an imperative sentence. In computer science, an imperative programming paradigm contains sequences of commands that focus on updating memory. Procedural paradigms wrap groups of imperative statements in procedures and functions. Declarative paradigms focus on what should be done, but not how. Functional paradigms are considered a subtype of declarative programming because they focus on what data should be transformed, but not how the transforms will occur.
10.1.1 Imperative programming is managing program state The programming paradigm of most computer systems created in the past 40 years centers around state management, or what’s called imperative programming systems. Pro- cedural and object languages are both examples of imperative programming systems. Figure 10.2 is an illustration of this state-management-focused system.
Logical functions
Physical memory (RAM) Data Program state
a=1 Data block b=7
reads writes
Program
Program Instruction block Instruction Program counter
Instruction
Instruction
Figure 10.2 Imperative programs divide physical memory (RAM) into two functional regions. One region holds the data block and the other the program block. The state of the program is managed by a program counter that steps through the program block, reading instructions that read and write variables in computer memory. The programs must carefully coordinate the reading and writing of data and ensure that the data is valid and consistent.
www.it-ebooks.info 212 CHAPTER 10 NoSQL and functional programming
Using program counters and memory to manage state was the goal of John von Neu- mann and others in the 1940s when they developed the first computer architecture. It specified that both data and programs were stored in the same type of core memory and a program counter stepped through a series of instructions and updated memory values. High-level programming languages divided memory into two regions: data and program memory. After a computer program was compiled, a loader loaded the pro- gram code into program memory and the data region of memory was allocated to store programming variables. This architecture worked well for single-processor systems as long as it was clear which programs were updating specific memory locations. But as programs became more complex, there was a need to control which programs could update various regions of memory. In the late 1950s, a new trend emerged that required the ability to protect data by allowing specific access methods to update regions of data memory. The data and the methods, when used together, formed new programming constructs called objects. An example of the object state is shown in figure 10.3. Objects are ideal for simulating real-world objects on a single processor. You model the real world in a series of programming objects that represent the state of the objects you’re simulating. For example, a bank account might have an account ID, an account holder name, and a current balance. A single bank location might be simu- lated as an object that contains all the accounts at that bank, and an entire financial institution might be simulated by many bank objects. But the initial object model using methods to guard object state didn’t have any inherent functions to manage concurrency. Objects themselves didn’t take into account that there might be hundreds of concurrent threads all trying to update the state of the objects. When it came to “undoing” a series of updates that failed halfway through a transaction, the simple object model became complex to manage. Keeping track of the state of many objects in an imperative world can become complex.
set-a get-b Object instance variables a=1 Your code b=7
get-a set-b
Accessor “methods” (getters and setters)
Figure 10.3 Object architecture uses encapsulation to protect the state of any object with accessor functions or methods. In this example, there are two internal states, a and b. In order to get the state of a, you must use the get-a method. To set the state of a, you must use the set-a method. These methods are the gatekeepers that guard all access to the internal state variables of an object.
www.it-ebooks.info What is functional programming? 213
Let’s take a look at a single operation on an object. In this example, the objective is to increment a single variable. The code might read x = x + 1, meaning “x is assigned a new value of itself plus one.” What if one computer or part of a network crashes right around this line of code? Does that code get executed? Does the current state repre- sent the correct value or do you need to rerun that line of code? Keeping track of the state of an ever-changing world is easy on a single system, but becomes more complex when you move to distributed computing. Functional pro- gramming offers a different way of approaching computation by recasting computa- tion as a series of independent data transformations. After you make this transition, you don’t need to worry about the details of state management. You move into a new paradigm where all variables are immutable and you can restart transforms again with- out worrying about what external variables you’ve changed in the process.
10.1.2 Functional programming is parallel transformation without side effects In contrast with keeping track of the state of objects, functional programming focuses on the controlled transformation of data from one form to another with a function. Functions take the input data and generate output data in the same way that you cre- ate mathematical functions in algebra. Figure 10.4 is an example of how functional programs work in the same way mathematical functions work. It’s important to note that when using functional programming there’s a new con- straint. You can’t update the state of the external world at any time during your trans- form. Additionally, you can’t read from the state of the external world when you’re executing a transform. The output of your transform can only depend on the inputs to the function. If you adhere to these rules, you’ll be granted great flexibility! You can execute your transform on clusters that have thousands of processors waiting to run your code. If any processor fails, the process can be restarted using the same input data without issue. You’ve entered the world of side-effect-free concurrent program- ming, and functional programming has opened this door for you. The concepts of serial processing are rooted in imperative systems. As an example, let’s look at how a for loop or iteration is processed in a serial versus a parallel process- ing environment. Whereas imperative languages perform transformations of data
Figure 10.4 Functional programming works similarly to mathematical functions. Instead of methods that modify the state of objects, functional programs transform only input data without side effects.
www.it-ebooks.info 214 CHAPTER 10 NoSQL and functional programming
Using imperative serial processing Using functional parallel processing
for loop in JavaScript for loop in XQuery
for(i=0; i < 5; i++){ let $seq := ('a','b','c'…) n = n + 1; for $item in $seq } return my-transform($item)
Figure 10.5 Iterations or for loops in imperative languages calculate one iteration of a loop and allow the next iteration to use the results of a prior loop. The left panel shows an example using JavaScript with a mutable variable that’s incremented. With some functional programming languages, iteration can be distributed on independent threads of execution. The result of one loop can’t be used in other loops. An example of an XQuery for loop is shown in the right panel.
elements serially, with each loop starting only after the prior loop completes, func- tional programming can process each loop simultaneously and distribute the process- ing on multiple threads. An example of this is shown in figure 10.5. As you can see, imperative programming can’t process this in parallel because the state of the variables must be fully calculated before the next loop begins. Some func- tional programming languages such as XQuery keep each loop as a separate and fully independent thread. But in some situations, parallel execution isn’t desirable, so there are now proposals to add a sequential option to XQuery functions. To understand the difference between an imperative and a functional program, it helps to have a good mental model. The model of a pipe without any holes as shown in figure 10.6 is a good representation. The shift of focus from updating mutable variables to only using immutable vari- ables within independent transforms is the heart of the paradigm shift that underpins many NoSQL systems. This shift is required so that you can achieve reliable and high- performance horizontal scaling in multiprocessor data centers.
Inputs Figure 10.6 The functional programming paradigm relies on creating a distinct output for each data input in an isolated transformation process. You can think of this as a data transformation pipe. When the transformation of input to output is done without modification of external memory, it’s called a zero-side-effect pipeline. This means you can rerun Memory the transform many times from any point without worrying about the impact of external systems. Additionally, if you prevent reads from external memory during the transformation, you have the added benefit of knowing the same input must generate the exact same output. Then you can hash the input and check a cache to see if the transform has already Output been done.
www.it-ebooks.info What is functional programming? 215
The deep roots of functional programming A person who’s grown up in the imperative world might be thinking, These people must be crazy! How could people ever write software without changing variables? The concepts behind functional programming aren’t new. They go back to the founda- tions of computer science in the 1940s. The focus on transformation is a core theme of lambda calculus, first introduced by Alonzo Church. It’s also the basis of many LISP-based programming languages popular in the 1950s. These languages were used in many artificial intelligence systems because they’re ideal for managing symbolic logic. Despite their popularity in AI research, LISP languages, and their kin (Schema, Clojure, and other languages), functional languages aren’t typically found in business applications. Fortran dominated scientific programming because it could be vector- ized with specialized compilers. COBOL became popular for accounting systems because it tried to represent business rules in a more natural language structure. In the 1960s, procedural systems such as PASCAL also became popular, followed by object-oriented systems such as C++ and Java in the 1980s and 1990s. As horizon- tal scalability becomes more relevant, it’s important to review the consequences of the programming language choices you’ve made. Functional programming has many advantages as processor counts increase.
One way to visualize functional programming in a scale-out system is to look at a series of inputs being transformed by independent processors. A consequence of using inde- pendent threads of execution is that you can’t pass the result of an intermediate calcu- lation from one loop to another. Figure 10.7 illustrates this process. This constraint is critical when you design transformations. If you need intermedi- ate results from a transformation, you must exit the function and return the output. If the intermediate result needs additional transformation, then a second function should be called. Another item to consider is that the order of execution and the time that trans- forms take to complete may vary based on the size of the data elements. For example, if your inputs are documents and you want to count the number of words in each doc- ument, the time it takes to count words in a 100-page document will be 10 times lon- ger than the time it takes to count words in a 10-page document (see figure 10.8).
ABC
Figure 10.7 Functional programming means that the intermediate results of the transformation of item A can’t be used in the transformation of item B, and the intermediate results of B can’t be used in calculating C. Only end results of each of these transforms can be used together to create aggregate values.
www.it-ebooks.info 216 CHAPTER 10 NoSQL and functional programming
Start order 3 1 2
Execution time 5 sec 2 sec 3 sec
Figure 10.8 Functional programming means that you can’t guarantee the order in which items will be transformed or what items will finish first.
This variation in transformation times and the location of the input data adds an addi- tional burden on the scheduling system. In distributed systems, input data is repli- cated on multiple nodes in the cluster. To be efficient, you want the longest-running jobs to start first. To maximize your resources, the tools that place tasks on different nodes in a cluster must be able to gather processing information from multiple data sources. Schedulers that can determine how long a transform will take to run on dif- ferent nodes and how busy each node is will be most efficient. This information is gen- erally not provided by imperative systems. Even mature systems like HDFS and MapReduce continue to refine their ability to efficiently transform large datasets.
10.1.3 Comparing imperative and functional programming at scale Now let’s compare the capability of imperative and functional systems to support pro- cessing large amounts of shared data being accessed by many concurrent CPUs. A comparison of imperative versus functional pipelines is shown in figure 10.9. You can see that when you prevent writes during a transformation, you get the ben- efit of no side effects. This means that you can restart a failed transformation and be certain that if it didn’t finish, the external state of a system wasn’t already updated. With imperative systems, you can’t make this guarantee. Any external changes may need to be undone if there’s a failure during a transformation. Keeping track of which operations have been done can add complexity that will slow large systems down. The
Imperative programming Functional programming
Data in Data in Figure 10.9 Imperative programming (left panel) and functional programming (right Data leakage panel) use different rules when transforming data. To gain the benefits of referential Data leakage Solid transparency, output of a transform must be steel completely determined by the inputs to the sides Data leakage transform. No other memory should be read or written during the transformation process. Data leakage Instead of a pipe with holes on the left, you can Side visualize your transformation pipes as having effects solid steel sides that don’t transfer any Data out Data out information.
www.it-ebooks.info What is functional programming? 217
no-side-effect guarantee is critical in your ability to create reproducible transforms that are easy to debug and easy to optimize. Not allowing external side-effect writes during a transform keeps transforms running fast. The second scalability benefit happens when you prohibit external reads during a transform. This rule allows you to know with certainty that the outputs are completely driven by the inputs of a transformation. If the time to create a hash of the input is small relative to the time it takes to run the transform, you can check a cache to see if the transform has already been run. One of the central theories of Lambda calculus is that the results of a transform of any data can be used in place of the actual transform of the data. This ability to sub- stitute cached value results, instead of having to rerun a long-running transform, is one way that functional programming systems can be more efficient than imperative systems. The ability to rerun a transform many times and not alter data is called an idempo- tent transform or an idempotent transaction. Idempotent transforms are transforma- tions that will change the state of the world in consistent ways the first time they’re run, but rerunning the transform many times won’t corrupt your data. For example, if you have a filter that will insert missing required elements into an XML file, that filter should check to make sure the elements don’t already exist before adding them. Idempotent transforms can also be used in transaction processing. Since idempo- tent transforms don’t change external state, there’s no need to create an undo pro- cess. Additionally, you can use transaction identifiers to guarantee idempotent transforms. If you’re running a transaction on an item of data that increments a bank account, you might record a transaction ID in the bank account transaction history. You can then create a rule that only runs the update if that transaction ID hasn’t been run already. This guarantees that a transaction won’t be run more than once. Idempotent transactions allow you to use referential transparency. An expression is said to be referentially transparent if it can be replaced with its value without chang- ing the behavior of the program. Any functional programming statement can have this property if the output of the transform can be replaced with the functional call to the transform itself. Referential transparency allows both the programmer and the compiler system to look for ways to optimize repeated calls to the same set of func- tions on the same set of data. But this optimization technique is only possible when you move to a functional programming paradigm. In the next section, we’ll take a detailed look at how referential transparency allows you to cache results from functional programs.
10.1.4 Using referential transparency to avoid recalculating transforms Now that you know how functional programs promote idempotent transactions, let’s look at how these results can be used to speed up your system. You can use these tech- niques in many systems, from web applications to NoSQL databases to the results of MapReduce transforms.
www.it-ebooks.info 218 CHAPTER 10 NoSQL and functional programming
Use cached data
URL in Yes Browser Inputs Get URL cache? cache
No Store for Output future use Get new data Get image from web server
Figure 10.10 Your local browser will check to see if the URL of an image is in a local cache before it goes out to the web to get the image. Once the image is fetched from a remote host (usually a slow process), it can be cached for future use. Getting an image from a local cache is much more efficient.
Referential transparency allows functional programming and distributed systems to be more efficient because they can be clever about avoiding the recalculation of results of transforms. Referential transparency allows you to use a cached version of an item that stands in for something that takes a long time to calculate or a long time to read data from a disk filesystem. Let’s take a simple example, as shown in figure 10.10, to see how an image is dis- played within a web browser. By default, browsers think that all images referenced by the same URL are the same “static” images, since the content of the images doesn’t change even if you call them repeatedly. If the same image is rendered on multiple web pages, the image will be retrieved only once from the web server. But to do this, all pages must reference the exact same URL. Subsequent references will use a copy of the image stored in a local client cache. If the image changes, the only way to get the updated image is to remove the original item from your local cache and refresh your screen or to refer- ence a new URL. By default your browser considers many items static, such as CSS files, JavaScript programs, and some data files if they’re marked accordingly by properties in the HTML headers. If your data files do change, you can instruct your server to add infor- mation to a file to indicate that the cached copy is no longer valid and to get a new copy from the server. This same concept of caching documents also applies to queries and transforms used to generate reports or web pages. If you have a large document that depends on many transforms of smaller documents, then you can reassemble the large document from the cached copies of smaller documents that don’t change and only re-execute the transforms of smaller documents that do change. The process of using the trans- form of documents from a cache is shown in figure 10.11. This transform optimization technique fits perfectly into functional programming systems and serves as the basis for innovative, high-performance website tools. As we move into our case study, you’ll see how NetKernel, a performance optimization tool, is used to optimize web page assembly.
www.it-ebooks.info Case study: using NetKernel to optimize web page content assembly 219
Use cached data
In Yes Calculate Cache hash cache?
Input No Store for Output future use Transform data Data Transform pipeline Only used if not in cache
Figure 10.11 You can use referential transparency to increase the performance of any transform by checking to see if the output of a transform is already in a cache. If it’s in the cache, you can use the cached value instead of rerunning an expensive transformation process. If the item isn’t in the cache, you can store it in a RAM cache such as memcache to reuse the output.
10.2 Case study: using NetKernel to optimize web page content assembly This case study looks at how functional programming and REST services can be com- bined to optimize the creation of dynamic nested structures. We’ll use as an example the creation of a home page for a news-oriented website. We’ll look at how an approach called resource-oriented computing (ROC) is implemented by a commercial framework called NetKernel.
10.2.1 Assembling nested content and tracking component dependencies Let’s assume you’re running a website that displays news stories in the center of a web page. The web page is dynamically assembled using page headers, footers, navigation menus, and advertising banners around the news content. A sample layout is shown in the left panel of figure 10.12. You use a dependency tree, as shown in the right panel of figure 10.12, to determine when each web page component should be regenerated. Though we think of public URLs as identifying web pages and images, you can use this same philosophy with each subcomponent in the dependency tree by treating each component as a separate resource and assigning it an internal identifier called a uniform resource identifier (URI). In the web page model, each region of the web page is a resource that can be assembled by combining other static and dynamic resources. Each distinct resource has its own URI. In the previous section, you saw how functional programming uses referential transparency to allow cached results to be used in place of the output of a function. We’ll apply this same concept to web page construction. You can use URIs to check whether a function call has already been executed on the same input data, and you can track dependencies to see when functions call other functions. If you do this, you
www.it-ebooks.info 220 CHAPTER 10 NoSQL and functional programming
Sample HTML news page Dependency tree of page components
Page Header
Article Ad Header Content Footer Article Article Ad Left Left Center Right Article Ad Article Article Article Article Ad Ad Article Ad
Footer Image Headline Teaser-test
Figure 10.12 Web pages for a typical news site are generated using a tree structure. All components of the web page can be represented in a dependency tree. As low-level content such as news or ads change, only the dependent parts of the web page need to be regenerated. If a low-level component changes (heavy-line links), then all ancestor nodes must be regenerated. For example, if some text in a news article changes, that change will cause the center section, the central content region, and the entire page to be regenerated. Other components such as the page borders can be reused from a cache layer without regeneration.
can avoid calling the same functions on the same input data and reuse information fragments that are expensive to generate. NetKernel knows what functions are called with what input data, and uses a URI to identify if the function has already generated results for the same input data. NetKer- nel also tracks URI dependencies, only re-executing functions when input data changes. This process is known as the golden thread pattern. To illustrate, think of hanging your clean clothes out on a clothes line. If the clothes line breaks, the clothes fall to the ground and must be washed again. Similarly, if an input item changes at a low level of a dependency tree, all the items that depend on its content must be regenerated. NetKernel automatically regenerates content for internal resources in its cache and can poll external resource timestamps to see if they’ve changed. To determine if any resources have changed, NetKernel uses an XRL file (an XML file used to track resource dependencies) and a combination of polling and expiration timestamps.
10.2.2 Using NetKernel to optimize component regeneration The NetKernel system takes a systematic approach to tracking what’s in your cache and what should be regenerated. Instead of using hashed values for keys, NetKernel constructs URIs that are associated with a dependency tree and uses models that calcu- late the effort to regenerate content. NetKernel performs smart cache-content optimi- zation and creates cache-eviction strategies by looking at the total amount of work it takes to generate a resource. NetKernel uses an ROC approach to determine the total effort required to generate a resource. Though ROC is a term specific to NetKernel, it
www.it-ebooks.info Case study: using NetKernel to optimize web page content assembly 221 signifies a different class of computing concepts that challenge you to reflect on how computation is done. ROC combines the UNIX concept of using small modular transforms on data mov- ing through pipes, with the distributed computing concepts of REST, URIs, and cach- ing. These ideas are all based on referential transparency and dependency tracking to keep the right data in your cache. ROC requires you to associate a URI with every com- ponent that generates data: queries, functions, services, and codes. By combining these URIs, you can create unique signatures that can be used to determine whether a resource is already present in your cache. A sample of the NetKernel stack is shown in figure 10.13. As you can see from figure 10.13, the NetKernel software is layered between two layers of resources and services. Resources can be thought of as static documents, and services as dynamic queries. This means that moving toward a service-oriented inter- mediate layer between your database and your application is critical to the optimiza- tion process. You can still use NetKernel and traditional, consistent hashing without a service layer, but you won’t get the same level of clever caching that the ROC approach gives you. By using ROC, NetKernel takes REST concepts to a level beyond caching images or documents on your web server. Your cache is no longer subject to a simple time- stamped eviction policy, and the most valuable items remain in cache. NetKernel can be configured to use complex algorithms to calculate the total effort it takes to put an item in cache, and only evicts items that have a low amount of effort to regenerate or are unlikely to be needed. To get the most benefits, a service-oriented REST approach should be used, and those services need to be powered by functions that return data with referential transparency. You can only begin this journey if your system is truly free from side effects, and this implies you may need to take a close look at the lan- guages your systems use.
Application Logical addresses NetKernel sits between your static (URls) Resources/Services resources and dynamic services layers to prevent unneeded calls TM NetKernel to your NoSQL database. Resources/Services NetKernel separates logical Physical resources expressed as URIs addresses NoSQL database from the physical calls to a NoSQL database. OS/VM
Network addresses Network
Figure 10.13 NetKernel works similarly to a memcache system that separates the application from the database. Unlike memcache, it’s tightly coupled with the layer that’s built around logical URIs and tracks dependencies of expensive calculations of objects that can be cached.
www.it-ebooks.info 222 CHAPTER 10 NoSQL and functional programming
In summary, if you associate URIs with the output of referentially transparent func- tions and services, frameworks such as NetKernel can bring the following benefits:
Faster web page response time for users Reduced wasteful re-execution of the same functions on the same data More efficient use of front-end cache RAM Decreased demand on your database, network, and disk resources Consistent development architecture Note that the concepts used in this front-end case study can also be applied to other topics in back-end analytics. Any time you see functions being re-executed on the same datasets, there are opportunities for optimization using these techniques. In our next section, we’ll look at functional programming languages and see how their properties allow you to address specific types of performance and scalability issues. 10.3 Examples of functional programming languages Now that you have a feeling for how functional programs work and how they’re differ- ent from imperative programs, let’s look at some real-world examples of functional programming languages. The LISP programming language is considered the pioneer of functional programming. LISP was designed around the concept of no-side-effect functions that work on lists. The concept of recursion over lists was frequently used. The Clojure language is a modern LISP dialect that has many benefits of functional programming with a focus on the development of multithreaded systems. Developers who work with content management and single-source publishing sys- tems may use transformation languages such as XSLT and XQuery (introduced in chapter 5). It’s no surprise that document stores also benefit from functional lan- guages that leverage recursive processing. Document hierarchies that contain other hierarchies are ideal candidates for recursive transformation. Document structures can be easily traversed and transformed using recursive functions. The XQuery lan- guage is a perfect fit with document stores because it supports recursion and func- tional programming, and yet uses database indexes for fast retrieval of elements. There has been strong interest by developers working on high-availability systems in a functional programming language called Erlang. Erlang has become one of the most popular functional languages for writing NoSQL databases. Erlang was originally developed by Ericsson, the Swedish telecommunications firm, to support distributed, highly available phone switches. Erlang supports features that allow the runtime libraries to be upgraded without service interruption. NoSQL databases that focus on high availability such as CouchDB, Couchbase, Riak, and Amazon’s SimpleDB services are all written in Erlang. The Mathematica language and the R language for doing statistical analysis also use functional programming constructs. These ideas allow them to be extended to run on a large numbers of processors. Even SQL, which doesn’t allow for mutable values, has
www.it-ebooks.info Making the transition from imperative to functional programming 223
some properties of functional languages. Both the SQL-like HIVE language and the PIG system that are used with Hadoop include functional concepts. Several multiparadigm languages have also been created to help bridge the gap between the imperative and functional systems. The programming language Scala was created to add functional programming features to the Java language. For software developers using Microsoft tools, Microsoft created the F# (F sharp) language to meet the needs of functional programmers. These languages are designed to allow develop- ers to use multiple paradigms, imperative and functional, within the same project. They have an advantage in that they can use libraries written in both imperative and functional languages. The number of languages that integrate functional programming constructs in dis- tributed systems is large and growing. Perhaps this is driven by the need to write MapReduce jobs in languages in which people feel comfortable. MapReduce jobs are being written in more than a dozen languages today and that list continues to grow. Almost any language can be used to write MapReduce jobs as long as those programs don’t have side effects. This requires more discipline and training when using impera- tive languages that allow side effects, but it’s possible. This shows that functional programming isn’t a single attribute of a particular lan- guage. Functional programming is a collection of properties that make it easier to solve specific types of performance, scalability, and reliability problems within a pro- gramming language. This means that you can add features to an old procedural or object-oriented language to make it behave more like a pure functional language. 10.4 Making the transition from imperative to functional programming We’ve spent a lot of time defining functional programming and describing how it’s different from imperative programming. Now that you have a clear definition of func- tional programming, let’s look at some of the things that will change for you and your development team.
10.4.1 Using functions as a parameter of a function Many of us are comfortable passing parameters to functions that have different data types. A function might have input parameters that are strings, integers, floats, Bool- eans, or a sequence of items. Functional programming adds another type of parame- ter: the function. In functional programming, you can pass a function as a parameter to another function, which turns out to be incredibly useful. For exam- ple, if you have a compressed zip file, you might want to uncompress it and pass a fil- ter function to only extract specific data files. Instead of extracting everything and then writing a second pass on the output, the filter intercepts files before they’re uncompressed and stored.
www.it-ebooks.info 224 CHAPTER 10 NoSQL and functional programming
10.4.2 Using recursion to process unstructured document data If you’re familiar with LISP, you know that recursion is a popular construct in functional programming. Recursion is the process of creating functions that call themselves. In our experience, people either love or hate recursion—there’s seldom a middle ground. If you’re not comfortable with recursion, it can be difficult to create new recursive programs. Yet once you create them, they seem to almost acquire a magical property. They can be the smallest programs that produce the biggest results. Functional programs don’t manage state, but they do use a call stack to remember where they are when moving through lists. Functional programs typically analyze the first element of a list, check whether there are additional items, and if there are, then call themselves using the remaining items. This process can be used with lists as well as tree structures like XML and JSON files. If you have unstructured documents that consist of elements and you can’t predict the order of items, then recursive processing may be a good way to digest the content. For example, when you write a paragraph, you can’t predict the order in which bold or italic text will appear in the paragraph. Languages such as XQuery and JASONiq sup- port recursion for this reason.
10.4.3 Moving from mutable to immutable variables As we’ve mentioned, functional programming variables are set once but not changed within a specific context. This means that you don’t need to store a variable’s state, because you can rerun the transform without the side effects of the variable being incremented again. The downside is that it may take some time for your staff to rid themselves of old habits, and you may need to rewrite your current code to work in a functional environment. Every time you see variables on both the left and right side of an assignment operator, you’ll have to modify your code. When you port your imperative code, you’ll need to refactor algorithms and remove all mutable variables used within for loops. This means that instead of using counters that increment or decrement variables in for loops, you must use a “for each item” type function. Another way to convert your code is to introduce new variable names when you’re referencing a variable multiple times in the same block of code. By doing this, you can build up calculations that nest references on the right side of assignment statements.
10.4.4 Removing loops and conditionals One of the first things imperative programmers do when they enter the world of func- tional programs is try to bring their programming constructs with them. This includes the use of loops, conditionals, and calls to object methods. These techniques don’t transfer to the functional programming world. If you’re new to functional programming and in your functions you see complex loops with lay- ers of nested if/then/else statements, this tells you it’s time to refactor your code.
www.it-ebooks.info Making the transition from imperative to functional programming 225
The focus of a functional programmer is to rethink the loops and conditionals in terms of small, isolated transforms of data. The results are services that know what functions to call based on what data is presented to the inputs.
10.4.5 The new cognitive style: from capturing state to isolated transforms Imperative programming has a consistent problem-solving (or cognitive) style that requires a developer to look at the world around them and capture the state of the world. Once the initial state of the world has been precisely captured in memory or object state, developers write carefully orchestrated methods to update the state of interacting objects. The cognitive styles used in functional programming are radically different. Instead of capturing state, the functional programmer views the world as a series of transformations of data from the initial raw form to other forms that do useful work. Transforms might convert raw data to forms used in indexes, convert raw data to HTML formats for display in a web page, or generate aggregate values (counts, sums, averages, and others) used in a data warehouse. Both object-oriented and functional programming do share one similar goal— how to structure libraries that reuse code to make the software easier to use and main- tain. With object orientation, the primary way you reuse code is to use an inheritance hierarchy. You move common data and methods up to superclasses, where they can be reused by other object classes. With functional programming, the goal is to create reusable transformation functions that build hierarchies where each component in the hierarchy is regenerated when the dependent element’s data changes. The key consequence of what style you choose is scalability. When you choose the functional programming approach, your transforms will scale to run on large clusters with hundreds or thousands of nodes. If you choose the imperative programming route, you must carefully maintain the state of a complex network of objects when there are many threads accessing these objects at the same time. Inevitably you’ll run into the same scalability problems you saw with graph stores in chapter 6 on big data. Large object networks might not fit within RAM, locking systems will consume more and more CPU cycles, and caching won’t be used effectively. You’ll be spending most of your CPU cycles moving data around and managing locks.
10.4.6 Quality, validation, and consistent unit testing Regardless of whether imperative or functional programming is used, there’s one observation that won’t change. The most reliable programs are those that have been adequately tested. Too often, we see good test-driven imperative programmers leap into the world of functional programming and become so disoriented that they forget everything that they learned about good test-driven development. Functional programs seem to be inherently more reliable. They don’t have to deal with which objects need to deallocate memory and when the memory can be released.
www.it-ebooks.info 226 CHAPTER 10 NoSQL and functional programming
Their focus is on messaging rather than memory locks. Once a transform is complete, the only artifact is the output document. All intermediate values are easily removed and the lack of side effects makes testing functions an atomic process. Functional programming languages also may have type checking for input and output parameters, and validation functions that execute on incoming items. This makes it easier to do compile-time checking and produces accurate runtime checks that can quickly help developers to isolate problems. Yet all these safeguards won’t replace a robust and consistent unit-testing process. To avoid sending corrupt, inconsistent, and missing data to your functions, a compre- hensive and complete testing plan should be part of all projects.
10.4.7 Concurrency in functional programming Functional programming systems are popular when multiple processes need to reli- ably share information either locally or over networks. In the imperative world, shar- ing information between processes involves multiple processes reading and writing shared memory and setting other memory locations called locks to determine who has exclusive rights to modify memory. The complexities about who can read and write shared memory values are referred to as concurrency problems. Let’s take a look at some of the problems that can occur when you try to share memory:
Programs may fail to lock a resource properly before they use it. Programs may lock a resource and neglect to unlock it, preventing other threads from using a resource. Programs may lock a resource for a long time and prevent others from using it for extended periods. Deadlocks occur where two or more threads are blocked forever, waiting for each other to be unlocked. These problems aren’t new, nor are they exclusive to traditional systems. Traditional as well as NoSQL systems have challenges locking resources on distributed systems that run over unreliable networks. Our next case study looks at an alternative approach to managing concurrency in distributed systems. 10.5 Case study: building NoSQL systems with Erlang Erlang is a functional programming language optimized for highly available distrib- uted systems. As we mentioned before, Erlang has been used to build several popular NoSQL systems including CouchDB, Couchbase, Riak, and Amazon’s SimpleDB. But Erlang is used for writing more than distributed databases that need high availability. The popular distributed messaging system RabbitMQ is also written in Erlang. It’s no coincidence that these systems have excellent reputations for high availability and scalability. In this case study, we’ll look at why Erlang has been so popular and how these NoSQL systems have benefited from Erlang’s focus on concurrency and mes- sage-passing architecture.
www.it-ebooks.info Case study: building NoSQL systems with Erlang 227
We’ve already discussed how difficult it is to maintain consistent memory state on multithreaded and parallel systems. Whenever you have multiple threads executing on systems, you need to consider the consequences of what happens when two threads are both trying to update shared resources. There are several ways that computer sys- tems share memory-resident variables. The most common way is to create stringent rules requiring all shared memory to be controlled by locking and unlocking func- tions. Any thread that wants to access global values must set a lock, make a change, and then unset the lock. Locks are difficult to reset if there are errors. Locking in dis- tributed systems has been called one of the most difficult problems in all of computer science. Erlang solves this problem by avoiding locking altogether. Erlang uses a different pattern called actor, illustrated in figure 10.14. The actor model is similar to the way that people work together to solve problems. When people work together on tasks, our brains don’t need to share neurons or access shared memory. We work together by talking, chatting, or sending email—all forms of message passing. Erlang actors work in the same way. When you program in Erlang, you don’t worry about setting locks on shared memory. You write actors that communicate with the rest of the world through message passing. Each actor has a queue of messages that it reads to perform work. When it needs to communicate with other actors, it sends them messages. Actors can also create new actors. By using this actor model, Erlang programs work well on a single processor, and they also have the ability to scale their tasks over many processing nodes by sending messages to processors on remote nodes. This single messaging model provides many benefits for including high availability and the ability to recover gracefully from both network and hardware errors. Erlang also provides a large library of modules called OTP that make distributed computing problems much easier.
Actor Message queue Actor Message queue
Actor Message queue Actor Message queue
Figure 10.14 Erlang uses an actor model, where each process has agents that can only read messages, write messages, and create new processes. When you use the Erlang actor model, your software can run on a single processor or thousands of servers without any change to your code.
www.it-ebooks.info 228 CHAPTER 10 NoSQL and functional programming
What is OTP? OTP is a large collection of open source function modules used by Erlang applica- tions. OTP originally stood for Open Telecom Platform, which tells you that Erlang was designed for running high-availability telephone switches that needed to run without interruption. Today OTP is used for many applications outside the telephone industry, so the letters OTP are used without reference to the telecommunications industry.
Together, Erlang and the OTP modules provide the following features to application developers:
Isolation—An error in one part of the system will have minimum impact on other parts of the system. You won’t have errors like Java NullPointerExceptions (NPEs) that crash your JVM. Redundancy and automatic failover (supervision)—If one component fails, another component can step in to replace its role in the system. Failure detection—The system can quickly detect failures and take action when errors are detected. This includes advanced alerting and notification tools. Fault identification—Erlang has tools to identify where faults occur and inte- grated tools to look for root causes of the fault. Live software updates—Erlang has methods for software updates without shutting down a system. This is like a version of the Java OSGi framework to enable remote installation, startup, stopping, updating, and uninstalling of new mod- ules and functions without a reboot. This is a key feature missing from many NoSQL systems that need to run nonstop. Redundant storage—Although not part of the standard OTP modules, Erlang can be configured to use additional modules to store data on multiple locations if hard drives fail. Because it’s based around the actor model and messaging, Erlang has inherent scale- out properties that come for free. As a result, these features don’t need to be added to your code. You get high availability and scalability just by using the Erlang infrastruc- ture. Figure 10.15 shows how Erlang components fit together. Erlang puts the agent model at the core of its own virtual machine. In order to have consistent and scalable properties, all Erlang libraries need to be built on this
C or other language Erlang applications Figure 10.15 The Erlang application runs on a series of services such as the Mnesia Mnesia SNMP Web SASL database, Standard Authentication and DBMS agents server Security Layer (SASL) components, monitoring OTP modules agents, and web servers. These services make calls to standardized OTP libraries that call the Erlang runtime system Erlang runtime system. Programs written in other languages don’t have the same support Operating system that Erlang applications have.
www.it-ebooks.info Apply your knowledge 229
infrastructure. The downside is that because Erlang depends on the actor model, it becomes difficult to integrate imperative systems like Java libraries and still benefit from its high availability and integrated scale-out features. Imperative functions that perform consistent transforms can be used, but object frameworks that need to manage external state will need careful wrapping. Because Erlang is based on Prolog, its syntax may seem unusual for people familiar with C and Java, so it takes some getting used to. Erlang is a proven way to get high availability and scale out of your distributed applications. If your team can overcome the steep learning curve, there can be great benefits down the road. 10.6 Apply your knowledge Sally is working on a business analytics dashboard project that assembles web pages that are composed of many small subviews. Most subviews have tables and charts that are generated from the previous week’s sales data. Each Sunday morning, the data is refreshed in the data warehouse. Ninety-five percent of the users will see the same tables and charts for the previous week’s sales, but some receive a customized view for their projects. The database that’s currently being used is an RDBMS server that’s overloaded and slow during peak daytime hours. During this time, reports can take more than 10 min- utes to generate. Susan, who is Sally’s boss, is concerned about performance issues. Susan tells Sally that one of the key goals of the project is to help people make better decisions through interactive monitoring and discovery. Susan lets Sally know in no uncertain terms that she feels users won’t take the time to run reports that take more than 5 minutes to produce a result. Sally gets two different proposals from different contractors. The architectures of both systems are shown in figure 10.16.
Proposal A Proposal B Without cache layer With cache layer Two views HTML Bar HTML Bar The transformation results of the table chart table chart are also stored in the cache. same data in a report. Java Java Transform Transform
JDBC JDBC REST Data service
SQL SQL Cache SQL
A single query All views of the same RDBMS is made to the RDBMS data get their data database for all Two queries for from the cache. views of a report. the same data.
Figure 10.16 Two business intelligence dashboard architectures. The left panel shows that each table and chart will need to generate multiple SQL statements, slowing the database down. The right panel shows that all views that use the same data can simply create new transforms directly from the cache, which lowers load on the database and increases performance.
www.it-ebooks.info 230 CHAPTER 10 NoSQL and functional programming
Proposal A uses Java programs that call a SQL database and generate the appropriate HTML and bitmapped images for the tables and charts of each dashboard every time a widget is viewed. In this scenario, two views of the same data, a bar chart in one view and an HTML table in another view, will rerun the exact same SQL code on the data warehouse. There’s no caching layer. Proposal B has a functional programming REST layer system that first generates an XML response from the SQL SELECT for each user interface widget and then caches this data. It then transforms the data in the cache into multiple views such as tables and charts. The system looks at the last-modified dates in the database to know if any of the data in the cache should be regenerated. Proposal B also has tools that prepopulate the cache with frequent reports after the data in the warehouse changes. The system is 25% more expensive, but the vendor claims that their solution will be less expensive to operate due to lower demand on the database server. The vendor behind proposal B claims the average dashboard widget generates a view in under 50 milliseconds if the data is in cache. The vendor also claims that if Sally uses SVG vector charts, not the larger bitmapped images, then the cached SVG charts will only occupy less than 30 KB in cache, and less than 3 KB if they’re compressed. Sally looks at both proposals and selects proposal B, despite its higher initial cost. She also makes sure the application servers are upgraded from 16 GB to 32 GB of RAM to provide more memory for the caches. According to her calculations, this should be enough to store around 10 million compressed SVG charts in the RAM cache. Sally also runs a script on Sunday night that prepopulates the cache with the most common reports, so that when users come in on Monday morning, the most frequent reports are already available from the cache. There’s almost no load on the database server after the reports are in the cache. When the project rolls out, the average page load times, even with 10 charts per page, are well under 3 seconds. Susan is happy and gives Sally a bonus at the end of the year. You should notice that in this example, the additional REST caching layer in a soft- ware application isn’t dependent on your using a NoSQL database. Since most NoSQL databases provide REST interfaces that provide cache-friendly results, they provide additional ways your applications can use a cache to lower the number of calls to your database. 10.7 Summary In this chapter, you’ve learned about functional programming and how it’s different from imperative programming. You learned how functional programming is the pre- ferred method for distributing isolated transformations of data over distributed sys- tems, and how systems are more scalable and reliable when functional programming is used. Understanding the power of functional programming will help you in several ways. It’ll help you understand that state management systems are difficult to scale and that
www.it-ebooks.info Further reading 231
to really benefit from horizontal scale-out, your team is going to have to make some paradigm shifts. Second, you should start to see that systems that are designed with both concurrency and high availability in mind tend to be easier to scale. This doesn’t mean you need to write all your business applications in Erlang func- tions. Some companies are doing this, but they tend to be people writing the NoSQL databases and high-availability messaging systems, not true business applications. Algorithms such as MapReduce and languages such as HIVE and PIG share some of the same low-level concepts that you see in functional languages. You should be able to use these languages and still get many of the benefits of horizontal scalability and high availability that functional languages offer. In our next chapter, we’ll leave the abstract world of cognitive styles and the theo- ries of computational energy minimization and move on to a concrete subject: secu- rity. You’ll see how NoSQL systems can keep your data from being viewed or modified by unauthorized users. 10.8 Further reading “Deadlock.” Wikipedia. http://mng.bz/64J7. “Declarative programming.” Wikipedia. http://mng.bz/kCe3. “Functional programming.” Wikipedia. http://mng.bz/T586. “Idempotence.” Wikipedia. http://mng.bz/eN5G. “Lambda calculus and programming languages.” Wikipedia. http://mng.bz/15BH. MSDN. “Functional Programming vs. Imperative Programming.” http://mng.bz/8VtY. “Multi-paradigm programming language.” Wikipedia. http://mng.bz/3HH2. Piccolboni, Antonio. “Looking for a map reduce language.” Piccolblog. April 2011. http://mng.bz/q7wD. “Referential transparency (computer science).” Wikipedia. http://mng.bz/85rr. “Semaphore (programming).” Wikipedia. http://mng.bz/5IEx. W3C. Example of forced sequential execution of a function. http://mng.bz/aPsR. W3C. “XQuery Scripting Extension 1.0.” http://mng.bz/27rU.
www.it-ebooks.info Security: protecting data in your NoSQL systems
This chapter covers
NoSQL database security model Security architecture Dimensions of security Application versus database-layer security trade-off analysis
Security is always excessive until it’s not enough. —Robbie Sinclair
If you’re using a NoSQL database to power a single application, strong security at the database level probably isn’t necessary. But as the NoSQL database becomes popular and is used by multiple projects, you’ll cross departmental trust boundar- ies and should consider adding database-level security. Organizations must comply with governmental regulations that dictate systems, and applications need detailed audit records anytime someone reads or changes data. For example, US health care records, governed by the Health Information Privacy Accountability Act (HIPAA) and Health Information Technology for Economic and
232
www.it-ebooks.info A security model for NoSQL databases 233
Clinical Health Act (HITECH Act) regulations, require audits of anyone who has accessed personally identifiable patient data. Many organizations need fine-grained controls over what fields can be viewed by different classes of users. You might store employee salary information in your data- base, but want to restrict access to that information to the individual employee and a specific role in HR. Relational database vendors have spent decades building security rules into their databases to grant individuals and groups of users access to their tabu- lar data at the column and row level. As we go through this chapter, you’ll see how NoSQL systems can provide enterprise-class security at scale. NoSQL systems are, by and large, a new generation of databases that focus on scale-out issues first and use the application layer to implement security features. In this chapter, we’ll talk about the dimensions of database security you need in a proj- ect. We’ll also look at tools to help you determine whether security features should be included in the database. Generally, RDBMSs don’t provide REST services as they are part of a multitier archi- tecture with multiple security gates. NoSQL databases do provide REST interfaces, and don't have the same level of protection, so it’s important to carefully consider security features for these databases. 11.1 A security model for NoSQL databases When you begin a database selection pro- cess, you start by sitting down with your General public
business users to define the overall secu- Intranet users rity requirements for the system. Using a Authenticated concentric ring model, as shown in users
figure 11.1, we’ll start with some terminol- Database ogy to help you understand how to build a administrators basic security model to protect your data. This model is ideal for getting started with a single application and a single data collection. It’s a simplified model Figure 11.1 One of the best ways to visualize a that categorizes users based on their database security system is to think of a series of concentric rings that act as walls around your access type and role within an organiza- data. The outermost ring consists of users who tion. Your job as a database architect is to access your public website. Your company’s select a NoSQL system that supports the internal employees might consist of everyone on your company intranet who has already been security requirements of the organiza- validated by your company local area network. tion. As you’ll see, the number of appli- Within that group, there might be a subset of cations that you run within your users to whom you’ve granted special access; database, your data classification, report- for example, a login and password to a database account. Within your database you might have ing tools, and the number of roles within structures that grant specific users special your organization will dictate what secu- privileges. A special class of users, database rity features your NoSQL database administrators, is granted all rights within the system. should have.
www.it-ebooks.info 234 CHAPTER 11 Security: protecting data in your NoSQL systems
Firewalls and application servers protect databases from unauthorized access.
Figure 11.2 If your database sits Internet Firewall App server behind an application server, the Database application server can protect the Reporting database from unauthorized access. If tools Reporting tools run directly on a you have many applications, including database so the database may reporting tools, you should consider need its own security layer. some database-level security controls.
If your concentric ring model stays simple, most NoSQL databases will meet your needs, and security can be handled at the application level. But large organizations with complex security requirements that have hundreds of overlapping circles for doz- ens of roles and multiple versions of these maps will find that only a few NoSQL sys- tems satisfy these requirements. As we discussed in chapter 3, most RDBMSs have mature security systems with fine-grained permission control at the column and row level associated with the database. In addition, data warehouse OLAP tools allow you to add rules that protect individual cell-based reports. Figure 11.2 shows how report- ing tools typically need direct access to the entire database. There are ways you can protect subsections of your data, and most reporting tools can be customized to access specific parts of your NoSQL database. Unfortunately, not all organizations have the ability to customize reporting tools or to limit the subsets of data that reporting tools can access. To function well, tools such as MapReduce also need to be aware of your security policy. As the use of the database grows within an organization, the need to access data crosses organizational trust boundaries. Eventu- ally, the need for in-database security will transition from “Not required” to “Nice to have” and finally to “Must have.” Figure 11.3 is an illustration of this process. Next, we’ll look at two methods that organizations can used to mitigate the need for in-database security models.
Must have Need for in- Enterprise-wide database security Regulated Nice to have Multiple projects
Multidivision reporting Single project Role-based access control
Not required Enterprise rollout timeline
Figure 11.3 As the number of projects that use a database grows, the need for in- database security increases. The tipping point occurs when an organization needs integrated real-time reports for operational data in multiple collections.
www.it-ebooks.info A security model for NoSQL databases 235
11.1.1 Using services to mitigate the need for in-database security One of the most time-consuming and expensive transitions organizations make is con- verting standalone applications running on siloed databases with their own security model to run in a centralized enterprise-wide database with a different security model. But if an organization splits its application into a series of reusable data services, they could avoid or delay this costly endeavor. By separating the data that each service pro- vides from other database components, the service can continue to run on a stand- alone database. Recall that in section 2.2 we talked about the concept of application layers. We compared the way functionality is distributed in an RDBMS to how a NoSQL system could address those same concerns by adding a layer of services. You can use this same approach to create service-based applications that run on separate lightweight NoSQL databases with independent in-database security models. To continue this service-driven strategy, you might need to provide more than sim- ple request-response services that take inputs and return outputs. This service-driven strategy works well for search or lookup services, but what if you have data that must be merged or joined with other large datasets? To meet these requirements, you must provide dumps of data as well as incremental updates to users for new and changing data. In some cases, these services can be used directly within ad hoc reporting tools. How long will the service-oriented strategy work? It starts to fail when the data vol- ume and synchronization complexity becomes too costly.
11.1.2 Using data warehouses and OLAP to mitigate the need for in-database security The need for security reporting tools is one of the primary reasons enterprises require security within the database, rather than at the application level. Let’s look at why this is sometimes not a relevant requirement. Let’s say the data in your standalone NoSQL database is needed to generate ad hoc reports using a centralized data warehouse. The key to keeping NoSQL systems inde- pendent is to have a process that replicates the NoSQL database information into your data warehouse. As you may recall from chapter 3, we reviewed the process of how data can be extracted from operational systems and stored in fact and dimension tables within a data warehouse. This moves the burden of providing security away from standalone performance- driven NoSQL services to the OLAP tools. OLAP tools have many options for protect- ing data, even at the cell level. Policies can be set up so that reports will only be gener- ated if there’s a minimum number of responses so that an individual can’t be identified or their private data viewed. For example, a report that shows the average math test score for third graders by race will only display if there are more than 10 stu- dents in a particular category. The process of moving data from NoSQL systems into an OLAP cube is similar to the process of moving from a RDBMS; the difference comes in the tools used. Instead
www.it-ebooks.info 236 CHAPTER 11 Security: protecting data in your NoSQL systems
of running overnight ETL jobs, your NoSQL database might use MapReduce pro- cesses to extract nightly data feeds on new and updated data. Document stores can run reports using XQuery or another query language. Graph stores can use SPARQL or graph query reporting tools that extract new operational data and load it into a central staging area that’s then loaded into OLAP cube structures. Though these archi- tectural changes might not be available to all organizations, they show that the needs of specialized data stores for specific performance and scale-out can still be integrated into an overall enterprise architecture that satisfies both security and ad hoc reporting requirements. Now that we’ve looked at ways to keep security at the application level, we’ll sum- marize the benefits of each approach.
11.1.3 Summary of application versus database-layer security benefits Each organization that builds a database can choose to put security at either the appli- cation or the database level. But like everything else, there are benefits and trade-offs that should be considered. As you review your organization’s requirements, you’ll be able to determine which method and benefits are the best fit. Benefits of application-level security: Faster database performance—Your database doesn’t have to slow down to check whether a user has permission on a data collection or an item. Lower disk usage—Your database doesn’t have to store access-control lists or visi- bility rules within the database. In most cases, the disk space used by access con- trol lists is negligible. There are some databases that store access within each key, and for these systems, the space used for storing security information must be taken into account. Additional control using restricted APIs—Your database might not be configured to support multiple types of ad hoc reports that consume your CPU resources. Although NoSQL systems leverage many CPUs, you still might want to limit reports that users can execute. By restricting access to reporting tools for some roles, these users can only run reports that you provide within an application. Benefits of database-level security: Consistency of security policy—You don’t have to put individualized security poli- cies within each application and limit the ability of ad hoc reporting tools. Ability to perform ad hoc reporting—Often users don’t know exactly what types of information they need. They create initial reports that show them only enough information to know they need to dig deeper. Putting security within the data- base allows users to perform their own ad hoc reporting and doesn’t require your application to limit the number of reports that users can run. Centralized audit—Organizations that run in heavily regulated industries such as health care need centralized audit. For these organizations, database-level secu- rity might be the only option.
www.it-ebooks.info Gathering your security requirements 237
Now that you know how a NoSQL system can fit into your enterprise, let’s look at how you can qualify a NoSQL database by looking at its ability to handle authentication, authorization, audit, and encryption requirements. Taking a structured approach to comparing NoSQL databases against these components will increase your organiza- tion’s confidence that a NoSQL database can satisfy security concerns. 11.2 Gathering your security requirements Selecting the right NoSQL system will depend on how complex your security require- ments are and how mature the security model is within your NoSQL database. Before embarking on a NoSQL pilot project, it’s a good idea to spend some time understand- ing your organization’s security requirements. We encourage our customers to group security requirements into four areas, as outlined in figure 11.4.
Are users and requests from Do users have read and/or write the people they claim to be? access to the appropriate data?
Authentication Authorization
Audit Encryption
Can you track who read or Can you convert data to a form that updated data and when they did it? can’t be used by unauthorized viewers?
Figure 11.4 The four questions of a secure database. You want to make sure that only the right people have access to the appropriate data in your database. You also want to track their access and transmit data securely in and out of the database.
The remainder of this chapter will focus on a review of authentication, authorization, audit, and encryption processes followed by three case studies that apply a security policy to a NoSQL database. Let’s begin by looking at the authentication process to see how it can be structured within your security requirements.
11.2.1 Authentication Authenticating users is the first step in protecting your data. Authentication is the pro- cess of validating the identity of a specific individual or a service request. Figure 11.5 shows a typical authentication process. As you’ll see, there are many ways to verify the identity of users, which is why many organizations opt to use an external service for the verification process. The good news is that many modern databases are used for web-only access, which allows them to use web standards and protocols outside of the database to verify a user. With this model, only validated users will ever connect with the database and the user’s ID can then be placed directly in an HTTP header. From there the database can look up the groups and roles for each user from an internal or external source.
www.it-ebooks.info 238 CHAPTER 11 Security: protecting data in your NoSQL systems
Authentication Authorization
Look up Role Database Id in Yes Yes Get/put Return groups or has access request header? data result roles to data? No No Company Log in Return error directory Database
No Login Yes Deny access ok
Figure 11.5 The typical steps to validate a web-based query before it executes within your database. The initial step checks for an identifier in the HTTP header. If the header is present, the authentication is done as shown on the left side of the figure. If the user ID isn’t in the HTTP header, the user is asked to log in and their ID and password are verified against a company-wide directory. The authorization phase shown on the right side of the figure will look up roles for a user and get a list of roles associated with that user. If any role has the rights, the query will execute and the results are returned to the user.
In large companies, a database may need to communicate with a centralized company service that validates a user’s credentials. Generally, this service is designed to be used by all company databases and is called a single sign-on, or SSO, system. If the company doesn’t provide an SSO interface, your database will have to validate users using a directory access API. The most common version is called the Lightweight Directory Access Protocol (LDAP). There are six types of authentication: basic access, digest access, public key, multi- factor, Kerberos, and Simple Authentication and Security Layer (SASL), as we’ll see next.
BASIC ACCESS AUTHENTICATION This is the simplest level of authenticating a database user. The login and password are put into the HTTP header in a clear plain-text format. This type of authentication should always be used with Secure Sockets Layer (SSL) or Transport Layer Security (TLS) over a public network; it can also be adequate for an internal test system. It doesn’t require web browser cookies or additional handshakes.
DIGEST ACCESS AUTHENTICATION Digest access authentication is more complicated and requires a few additional hand- shakes between the client and the database. It can be used over an unencrypted non- SSL/TLS connection in low-stakes situations. Because digest authentication uses a standard MD5 hash function, it’s not considered a highly secure authentication method unless other steps have been implemented. Using digest access authentica- tion over SSL/TLS is a good way to increase password security.
www.it-ebooks.info Gathering your security requirements 239
PUBLIC KEY AUTHENTICATION Public key authentication uses what’s known as asymmetric cryptography, where a user has a pair of two mutually dependent keys. What’s encrypted with one of these keys can only be decrypted with the other. Typically, a user makes one of these keys public, but keeps the other key completely private (never giving it out to anyone). For authentication, the user encrypts a small piece of data with their private key and the receiver verifies it using the user’s public key. This is the same type of authentication used with the secure shell (SSH) command. The drawback of this method is that if a hacker breaks into your local computer and takes your private key, they could gain access to the database. Your database is only as secure as the private keys.
MULTIFACTOR AUTHENTICATION Multifactor authentication relies on two or more forms of authentication. For exam- ple, one factor might be something you have, such as a smart card, as well as some- thing you know, like a PIN number. To gain access you must have both forms. One of the most common methods is a secure hardware token that displays a new six-digit number every 30 seconds. The sequences of passwords are synced to the database using accurate clocks that are resynchronized each time they’re used. The user types in their password and the PIN from the token to gain access to the database. If either the password or the PIN is incorrect, access is denied. As an additional security measure, you can restrict database access to a range of IP addresses. The problem with this method is that the IP address assignments can change frequently for remote users, and IP addresses can be “faked” using sophisti- cated software. These types of filters are usually placed within firewalls that are in front of your database. Most cloud hosting services allow these rules to be updated via a web page.
KERBEROS PROTOCOL AUTHENTICATION If you need to communicate in a secure way with other computers over an insecure network, the Kerberos system should be considered. Kerberos uses cryptography and trusted third-party services to authenticate a user’s request. Once a trust network has been set up, your database must forward the information to a server to validate the user’s credentials. This allows a central authority to control the access policy for each session.
SIMPLE AUTHENTICATION AND SECURITY LAYER Simple Authentication and Security Layer (SASL) is a standardized framework for authentication in communication protocols. SASL defines a series of challenges and responses that can be used by any NoSQL database to authenticate incoming network requests. SASL decouples the intent of validating a network request from the underly- ing mechanism for validating the request. Many NoSQL systems simply define a SASL layer to indicate that at this layer a valid request has entered the database.
www.it-ebooks.info 240 CHAPTER 11 Security: protecting data in your NoSQL systems
11.2.2 Authorization Once you’ve verified the identity of your users (or their agents), you’re ready to grant them access to some or all of the database. The authorization process is shown in figure 11.5. Unlike authentication, which occurs once per session or query request, authori- zation is a more complex process since it involves applying a complex, enterprise-wide, access-control policy to many data items. If not implemented carefully, authorization can negatively impact the performance of large queries. A second requirement to con- sider is the issue of security granularity and its impact on performance, as illustrated in figure 11.6. As you leave the authentication step and move toward the authorization phase of a query, you’ll usually have an identifier that indicates which user is making the request. You can use this identifier to look up information about each user; for exam- ple, what department, groups, and projects they’re associated with and which roles they’ve been assigned. Inside NoSQL databases, you think of this information as an individual’s smart-card badge and your database as a series of rooms in a building with security checkpoints at each door. But most user interfaces use folder icons that contain other folders. Here, each folder corresponds to a directory (or collection) and document within the database. In other systems, this same folder concept uses buckets to describe a collection of documents. Figure 11.7 is an example of this folder/collection concept. You use the information about a user and their groups to determine whether they have access to a folder and what actions they can perform. This implies that if you want to read a file, you’ll need read access to the directories that contain the file as well as all the ancestor folders up to the root folder of the database. As you can see, the deeper you go into a directory, the more checks you need to perform. So the checks need to be fast if you have many folders within folders. In most large database systems, the authorization process will first look up addi- tional information about each user. The most typical information might be what orga- nization you work for in the company, what projects or groups you’re associated with, and what roles you’ve been assigned. You can use the user identifier directly against
Figure 11.6 Before you create a NoSQL application, you must Coarse-grained access control consider the granularity of – little performance impact security your applications needs. A course grain allows Database access control at the entire database or collection level. Collection Finer-grained controls allow you to control access to a collection, Document an individual document, or an Fine-grained access control element within a document. But – large performance impact Element fine-grain may have performance impacts on your system.
www.it-ebooks.info Gathering your security requirements 241
Database root collection
Database Department collection
Application collection
Document Element
Figure 11.7 A document store is like a collection of folders, each with its own lock. To get to a specific element within a document, you need to access all the containers that hold the element, including the document and the ancestor folders. your database, but keeping track of each user and their permission gets complicated quickly, even for a small number of users and records using a UNIX permission model.
THE UNIX PERMISSION MODEL First let’s look at a simple model for protecting resources, the one used in UNIX, POSIX, Hadoop, and the eXist native XML database. We use the term resource in this context as either a directory or a file. In the UNIX model, when you create any resource, you associate a user and group as the owner of that resource. You then associate three bits with the owner, three bits with the user, and three bits for everyone that’s outside the group. This model is shown in figure 11.8. One positive aspect of a UNIX filesystem permission model is that they’re efficient, since you only need to calculate the impact of nine bits on your operations. But the problem with this model is that it doesn’t scale, because each resource (a folder or file) is typically owned by one group, and granting folder rights to multiple groups isn’t permitted. This prevents organizations from applying detailed access-control
Your own Your group's permissions permissions Everyone else
owner group others The permissions for The letters RWX RWX RWX RWX anyone outside your are for read, write, and group are execute permissions. 110 110 100 read=true, write=false, and execute=false.
Figure 11.8 UNIX, POSIX, Hadoop, and eXist-db all share a simple approach to security that uses only nine bits per resource. Checks are simple and fast, and won’t degrade performance even when large queries are done on many collections of many documents.
www.it-ebooks.info 242 CHAPTER 11 Security: protecting data in your NoSQL systems
policies at the database level. Next, we’ll look at an alternative model, role-based access control, which is scalable for large organizations with many departments, groups, projects, and roles.
USING ROLES TO CALCULATE ACCESS CONTROL An alternative authorization model that associates permissions with roles that has turned out to be scalable and the predominant choice for large organizations is called role-based access control (RBAC), shown in a simplified form in figure 11.9. Using a role-based, access-control model requires each organization to define a set of roles associated with each set of data. Typically, applications are used to control col- lections of data, so each application may have a set of roles that can be configured. Once roles have been identified, each user is assigned one or more roles. The applica- tion will then look up all the roles for each user and apply them against a set of per- missions at the application level to determine whether a user has access. It’s clear that most large organizations can’t manage a detailed access-control pol- icy using a simple nine-bit structure like UNIX. One of the most difficult questions for an application architect is whether access can be controlled at the application level, instead of the database level. Note that some applications need to support more than simple read and write access control. For example, content management systems can restrict who can update, delete, search, copy, or include various document components when new doc- uments are generated. These fine-grained actions on data collections are generally controlled within the application level.
11.2.3 Audit and logging Knowing who accessed or updated what records and when they took these actions is the job of the auditing component of your database. Good auditing systems allow for a detailed reconstruction and examination of a sequence of events if there are security breaches or system failures. A key component of auditing is to make sure that the
Each user has one or Resources are associated more roles in the database. with a permission for each role.
Permission Resource User Role (read, write) (collection, document)
Roles are associated with one or more permissions.
Figure 11.9 This figure shows a simplified role-based access control (RBAC) model that associates one or more roles with each user. The roles are then bound to each resource through a permission code such as read, write, update, delete, and so on. The RBAC model allows a security policy to be more maintainable when users aren’t tied to particular resources.
www.it-ebooks.info Gathering your security requirements 243
right level of detail is logged before you need it. It’s always a problem to say “Yes, we should’ve been logging these events” after the events have occurred. Auditing can be thought of as event logging and analysis. Most programming lan- guages have functions that add events to a log file, so almost all custom applications can be configured to add data to event logs at the application layer. There are some exceptions to this rule, for example, when you’re using third-party applications where source code can’t be modified. In these situations database triggers can be used to add logging information. Most mature databases come with an extensive set of auditing reports that show detailed activity of security-related transactions, such as
Web page requests—What web pages were requested, by what users (or IP addresses), and when they were accessed. Additional data, such as the response time, can also be added to the log files. This function is typically done by differ- ent web servers and merged into central access logs. Last logins—The last user to log in to the database sorted with the most recent logins at the top of the report. Last updates—The last user to make updates to the database. These reports can have options for sorting by date or collections modified. Failed login attempts—The prior login attempts to the database that failed. Password reset requests—A list of the most recent password reset requests. Import/upload activity—A list of the most recent database imports or bulk loads. Delete activity—A list of the most recent records removed from the database. Search—A list of the most recent searches performed on the database. These reports can also include the most frequent queries over given periods of time. Backup activity—When data was backed up or restored from backup. In addition to these standard audit reports, there may be specialized reports that are dependent on the security model you implement. For example, if you’re using role- based access control, you might want a detailed accounting of which user was assigned a role and when. Applications might also require special audit information be added to log files, such as actions that have high impact on the organization. This information can be added at the application level, and if you have control of all applications, this method is appropriate. There’s some additional logging that should be done at the database layer. In RDBMSs, triggers can be written to log data when an insert, update, or delete operation occurs. In NoSQL databases that use collections, triggers can be added to collections as well. Trigger-based logging is ideal when there are many applications that can change your data.
11.2.4 Encryption and digital signatures The final concern of NoSQL security is how a database encrypts and digitally signs documents to verify they haven’t been modified. These processes can be done at the
www.it-ebooks.info 244 CHAPTER 11 Security: protecting data in your NoSQL systems
application level as well as the database level; NoSQL databases aren’t required to have built-in libraries to perform these functions. Yet for some applications, knowing the database hasn’t been modified by unauthorized users can be critical for regulatory compliance. Encryption processes use tools similar to the hash functions we described in chap- ter 2. The difference is that private and public keys are used in combination with cer- tificates to encrypt and decrypt documents. Your job as architects is to determine whether encryption should be done at the application or database layer. By adding these functions in the database layer, you have centralized control over the methods that store and access the data. Adding the functions at the application layer requires each application to control the quality of the cryptographic library. The key issue with any encryption tool isn’t the algorithm itself, but the process of ensur- ing the cryptographic algorithm hasn’t been tampered with by an unauthorized party. Some NoSQL databases, especially those used in high-stakes security projects, are required to have their cryptographic algorithms certified by an independent auditor. The US National Institute of Standards and Technology (NIST) has published Federal Information Processing Standard (FIPS) Publication 140-2 that specifies multiple levels of certification for cryptographic libraries. If your database holds data that must be securely encrypted before it’s transmitted, then these standards might be required in your database.
XML SIGNATURES The World Wide Web consortium has defined a standard method of digitally signing any part of an XML file using an XML Signature. XML Signatures allow you to verify the authenticity of any XML document, any part of an XML document, or any resource represented in a URI with a cryptographic hash function. An XML Signature allows you to specify exactly what part of a large XML document should be signed and how that digital signature is included in the XML document. One of the most challenging problems in digitally signing an entire XML docu- ment is that there are multiple ways a document might be represented in a single string. In order for digital signatures to be verified, both the sender and receiver must agree on a consistent way to represent XML. For example, you might agree that ele- ments shouldn’t be on separate lines or have indented spaces, attributes should be in alphabetical order, and characters should be represented using UTF-8 encoding. Digi- tal signatures can avoid these problems by only signing the data within XML elements, not the element and attribute names that contain the values. As an example, US federal law requires that all transmissions of a doctor’s prescrip- tions of controlled substances be digitally signed before they’re transmitted between computer systems. But the entire prescription doesn’t need to be verified. Only the critical elements such as the doctor prescribing the drug, the drug name, and the drug quantity need to be digitally signed and verified. An example of this is shown in listing 11.1, which shows how the rules of extracting part of the text within a docu- ment are specified before they’re signed. In the example, the prescriber (Drug
www.it-ebooks.info Gathering your security requirements 245
Enforcement Agency ID Number), drug description, quantity, and date are digitally signed, but other parts of the document aren’t included in the final string to be signed.
Listing 11.1 Adding a digital signature to a document
11.2.5 Protecting pubic websites from denial of service and injection attacks Databases with public interfaces are vulnerable to two special types of threats: denial of service (DOS) and injection attacks. A DOS attack occurs when a malicious party attempts to shut down your servers by repeatedly sending queries to your website with the intention of overwhelming it and preventing access by valid users. The best way to prevent DOS attacks is by looking for repeated rapid requests from the same IP address.
www.it-ebooks.info 246 CHAPTER 11 Security: protecting data in your NoSQL systems
An injection attack occurs when a malicious user adds code to an input field in a web form to directly control your database. For example, a SQL injection attack might run a SQL query within a search field string to get a listing of all users of the system. Protecting NoSQL systems is no different; all input fields that have public access, such as search forms, should be filtered to remove invalid query strings before processing. Each public interface must have filters that remove any invalid queries to your system. Preventing these types of attacks isn’t usually the job of your database. This kind of task is the responsibility of your front-end application or firewall. But the libraries and code samples for each database should have examples to show you how to prevent these types of attacks. This concludes our discussion on security requirements. Now we’ll look at three case studies: security in practice for one key-value store, one column family store, and one document store. 11.3 Case Study: access controls on key-value store— Amazon S3 Amazon Simple Storage Service (S3) is a web-based service that lets you store your data in the cloud. From time to time our customers ask, “Aren’t you worried about security in the cloud? How can you make sure your data is secure?” Authentication mechanisms are important to make sure your data is secure from unwanted access. Industries such as health insurance, government agencies, or regula- tions (like HIPAA) require you to keep your customer’s data private and secure, or face repercussions. In S3, data such as images, files, or documents (known as objects) is securely stored in buckets, and only bucket/object owners are allowed access. In order to access an object, you must use the Amazon API with the appropriate call and credentials to retrieve an object. To access an object, you must first build a signature string with the date, GET request, bucket name, and object name (see the following listing).
Listing 11.2 XQuery code for creating a string to sign using your AWS credentials
let $nl := " " (: the newline character :) let $date := aws-utils:http-date() let $string-to-sign := concat('GET', $nl, $nl, $nl, $nl, 'x-amz-date:', $date, $nl, '/', $bucket, '/', $object) Once the signature string is built, the signature and your S3 secret key are encrypted, as shown in the following listing.
Listing 11.3 XQuery for signing a string with AWS secret key and hmac()
let $signature := crypto:hmac($string-to-sign, $s3-secret-key, "SHA-1", "base64")
www.it-ebooks.info Case Study: access controls on key-value store—Amazon S3 247
Finally, the headers are constructed and the call to retrieve the object is made. As you can see in listing 11.4, the encrypted signature is combined in the header with your S3 access key and the call is made.
Listing 11.4 XQuery for REST HTTP GET with the AWS security credentials
let $headers :=
Identity and Access Management (IAM) Access-control lists (ACLs) Bucket policies 11.3.1 Identity and Access Management (IAM) IAM systems allow you to have multiple users within an AWS account, assign creden- tials to each user, and manage their permissions. Generally, IAM systems are found in organizations where there’s a desire to grant multiple employees access to a single AWS account. To do this, permissions are managed using a set of IAM policies that are attached to specific users. For example, you can allow a user dan to have permission to add and delete images from your web-site-images bucket.
11.3.2 Access-control lists (ACL) Access-control lists can be used to grant access to either buckets or individual objects. Like IAM systems, they only grant permissions and are unable to deny or restrict at an account level. In other words, you can only grant other AWS accounts access to your Amazon S3 resources. Each access-control list can have up to 100 grants, which can be either individual account holders or one of Amazon’s predefined groups:
Authenticated Users group—Consists of all AWS accounts All Users group—Consists of anyone, and the request can be signed or unsigned When using ACLs, a grantee can be an AWS account or one of the predefined Amazon S3 groups. But the grantee can’t be an IAM User.
www.it-ebooks.info 248 CHAPTER 11 Security: protecting data in your NoSQL systems
11.3.3 Bucket policies Bucket policies are the most flexible method of security control, because they can grant as well as deny access to some or all objects within a specified bucket at both account and user levels. Though you can use bucket policies in conjunction with IAM policies, bucket poli- cies can be used on their own and achieve the same result. For example, figure 11.10 demonstrates how two users (Ann and Dan) have been granted authority to put objects into a bucket called bucket_kma. Perhaps you’re wondering when to use a bucket policy versus an ACL. The answer is that it depends on what you’re trying to accomplish. Access-control lists provide a coarse-grained approach to granting access to your buckets/objects. Bucket policies have a finer-grained approach. There are times when using both bucket policies and ACLs make sense, such as
You want to grant a wide variety of permissions to objects but you only have a bucket policy. Your bucket policy is greater than 20 KB in size. The maximum size for a bucket policy is 20 KB. If you have a large number of objects and users, you can grant additional permissions using an ACL. There are a few things to keep in mind when combining bucket policies and ACLs:
If you use ACLs with bucket policies, S3 will use both to determine whether the account has permissions to access an object. If an account has access to an object through an ACL, it’ll be able to access the requested bucket/object.
...bucket policy
IAM policy... Allow Who Ann Figure 11.10 You can Allow Dan use a bucket policy to grant users access to your Actions: Actions: AWS S3 objects without PutObject ...is the same as... PutObject using IAM policies. On the left, the IAM policy allows Resource Resource the PutObject action for Aws:s3:::bucket_kma/* Aws:s3:::bucket_kma/* bucket_kma in an AWS account, and then gives the users Ann and Dan permission to access that account/action. On the right, the bucket policy is attached to bucket_kma and like the IAM gives Ann and Dan permission to access PutObject on Ann Dan the bucket.
www.it-ebooks.info Case study: using key visibility with Apache Accumulo 249
When using a bucket policy, a deny statement overrides a grant statement and will restrict an account’s ability to access a bucket/object, essentially making the bucket policy a superset of the permissions you grant an ACL. As you can see, S3 offers multiple mechanisms for keeping your data safe in the cloud. With that in mind, is it possible to grant fine-grained security on your tabular data without impacting performance? Let’s look at how Apache Accumulo uses roles to control user access to the confidential information stored in your database. 11.4 Case study: using key visibility with Apache Accumulo The Apache Accumulo project is similar to other column family architectures, but with a twist. Accumulo has an innovative way of granting fine-grained security to tabular data that’s flexible and doesn’t have a negative performance impact for large datasets. Accumulo’s method of protecting fine-grained data in the database layer is to place role, permissions, or access list information directly in the key of a key-value store. Since organizations can have many different models of access control, this could make the size of keys unwieldy and take considerable disk space. Instead of put- ting multiple security models into multiple fields of a key, Accumulo adds a single gen- eral field, called Visibility, that’s evaluated for each query and returns true or false each time the key is accessed. This is illustrated in figure 11.11. The format of the Visibility field isn’t restricted to a single user, group, or role. Because you can place any Boolean expression in this field, you can use a variety of different access-control mechanisms. Every time you access Accumulo, your query context has a set of Boolean authori- zation tokens that are associated with your session. For example, your username, role, and project might be set as authorization tokens. The visibility of each key-value is cal- culated by evaluating the Boolean AND (&) and OR (|) combinations of authorization strings that must return true for a user to view the value of a key-value pair. For example, if you add the expression (admin|system)&audit, you’d need either admin OR system authorization AND the audit to be able to read this record. Although putting the actual logic of evaluation within the key itself is unusual, it allows many different security models to be implemented within the same database.
Key
Column Row ID Timestamp Value Family Qualifier Visibility
Figure 11.11 Apache Accumulo adds a Visibility field to the column portion of each key. The Visibility field contains a Boolean expression composed of authorization tokens that each return true or false. When a user tries to access a key, the Visibility expression is evaluated, and if it returns true, the value is returned. By restricting the Visibility expression to be only Boolean values, there’s a minimal performance penalty even for large datasets.
www.it-ebooks.info 250 CHAPTER 11 Security: protecting data in your NoSQL systems
As long as the logic is restricted to evaluating simple Boolean expressions, there’s a minimal impact on performance. In our last case study, we’ll look at how to use MarkLogic’s role-based, access- control security model for secure publishing. 11.5 Case study: using MarkLogic’s RBAC model in secure publishing In this case study, we’ll look at how role-based access control (RBAC) can protect highly sensitive documents within a large organization and still allow for fine-grained status reporting. Our example will allow a distributed team of authors, editors, and reviewers to create and manage confidential documents and yet prevent unauthorized users from accessing the text of the confidential documents. Let’s assume you’re a book publisher and you have a contract to create a new book on a hot, new NoSQL database that’s being launched in a few months. The problem is that the company developing the database wants assurances that only a small list of people will be able to access the book’s content during its development. Your contract specifically states that no employees other than the few listed in the contract can have access to the text of the documents. Your payments are contingent on the contents of the book staying private. The contract only allows high-level reports of book metrics to be viewed outside the small authoring team. Your publishing system has four roles defined: Author, Editor, Publisher, and Reviewer. Authors and Editors can change content, but only users with the Publisher role can make a document available to reviewers. Reviewers have collections config- ured so that they can add comments in a comment log, but they can’t change the main document content.
11.5.1 Using the MarkLogic RBAC security model to protect documents MarkLogic has built-in, database-layer support for role-based access control, as described in “Using roles to calculate access control” in section 11.2.2. The MarkLogic security model uses many of these same RBAC concepts as well as implements some enhanced functionality. MarkLogic applies its security policy at the collection and the document levels and allows users to create functions that have elevated permissions. This feature enables element-level control of selected documents without compromising performance. This case study will first review the MarkLogic security model and then show how it can be applied in a secure publishing example. Finally, we’ll review the business bene- fits of this model. A logical diagram of how MarkLogic security models work is shown in figure 11.12. Here are some of the interesting points of the MarkLogic RBAC security model:
Role hierarchy—Roles are configured in a hierarchy, so a lower-level role will automatically inherit the permissions of any parent roles.
www.it-ebooks.info Case study: using MarkLogic’s RBAC model in secure publishing 251
Users and roles both have default Amplified permission (AMP) Multiple roles can be permissions for associated with special documents and Execute privilege privileges on functions, collections. queries, and URIs. URI privilege Document
User Role Permission
Collection
Roles exist in a hierarchy Each permission record, stored with Each document and lower roles inherit a document or collection, associates and collection is permissions from a single capability (read, write, update, associated upper roles. or execute) with a single role. with a URI and permissions.
Figure 11.12 The MarkLogic security model is based on the role-based access control (RBAC) model with extensions to allow elevated permissions for executing specific functions and queries. Documents and collections each have a set of permissions that consist of role-capability pairs.
Default permissions—Users and roles can each be configured to provide default permissions for both documents and collections. Elevated security functions—Functions can run at an elevated security level. The elevated security only applies within the context of a function. When the func- tion finishes, the security level is lowered again. Compartments—An additional layer of security beyond RBAC is available with an optional license. Compartmentalized security allows complex Boolean AND/OR business rules to be associated with a security policy.
11.5.2 Using MarkLogic in secure publishing To enforce the contract rules, create a new role for the project called secret-nosql- book using the web-based role administration tools and associate the new role with the collection that contains all of the book’s documents including text, images, and reviewer feedback. Then configure that collection to include the role of secret- nosql-book to have read and update access to that collection. Also remove all read access for people not within this group from the collection permissions. Make sure that all new documents and subcollections created within this collection use the cor- rect default permission setting. Finally, add the role of secret-nosql-book to only the users assigned to the project. The project also needs to provide a book progress report that an external project manager can run on demand. This report counts the total number of chapters, sec- tions, words, and figures in the book to estimate chapter-by-chapter percentage com- pletion status. To implement this, give the report elevated rights to access the content using functions that use amplified permission (AMP) settings. External project
www.it-ebooks.info 252 CHAPTER 11 Security: protecting data in your NoSQL systems
managers don’t need access to the content of the book, since the functions that use the amplified permissions will only gather metrics and not return any text within the book. Note that in this example, application-level security wouldn’t work. If you used application-level security, anyone who has access to the reporting tools would be able to run queries on your confidential documents.
11.5.3 Benefits of the MarkLogic security model The key benefit of the RBAC model combined with elevated security functions is that access control can be driven from a consistent central control point and can’t be cir- cumvented by reporting tools. Element-level reports can still be executed on secure collections for specialized tasks. This implementation allows flexibility with minimal performance impact—something that’s critical for large document collections. MarkLogic has many customers in US federal systems and enterprise publishing. These industries have stringent requirements for database security and auditing. As a result, MarkLogic has one of the most robust, yet flexible, security models of any NoSQL database. The MarkLogic security model may seem complex at first. But once you under- stand how roles drive security policy, you’ll find you can keep documents secure and still allow reporting tools full access to the database layer. Experienced MarkLogic developers feel that the security model should be designed at an early stage of a project to ensure that the correct access controls are granted to the right users. Careful definition of roles within each project is required to ensure that security policies are correctly enforced. Once the semantics of roles has been clearly defined, implementing the policy is a straightforward process. In addition to the RBAC security model supported by MarkLogic, there are also specialized versions of MarkLogic that allow the creation of collections of highly sensi- tive containers. These containers have additional security features that allow for the storage and auditing of classified documents. MarkLogic also integrates auditing reports with their security model. Auditors can view reports every time elevated security functions are executed by a user or roles are changed. A detailed history of every role change can be generated for each project. These reports show how security policy has been enforced and which users had access to collection content and when. The RBAC security model isn’t the only feature that MarkLogic has implemented to meet the security demands of its customers. Other security-related features include tamper-resistance of cryptography and network libraries, extensive auditing tools and reports, and third-party auditing of security libraries. Each of these features becomes more important as your NoSQL database is used by a larger community of security conscious users. 11.6 Summary In this chapter, you’ve learned that, for simple applications, NoSQL databases have minimal security requirements. As the complexity of your applications increases, your
www.it-ebooks.info Further reading 253
security requirements grow until you reach the need for enterprise-grade security within your NoSQL database. You’ve also learned that by using a service-oriented architecture, you can mini- mize the need for in-database security. Service-driven NoSQL databases have lower in- database security requirements and provide specialized data services that can be reused at the application level. Early implementations of NoSQL databases focused on new architectures that had strong scale-out performance. Security wasn’t a primary concern. In the case studies, we’ve shown that there are now several NoSQL databases that have flexible security models for key-value stores, column family stores, and document stores. We’re confi- dent that other NoSQL databases will include additional security features as they mature. So far, we’ve covered many concepts. We’ve given you visuals, examples, and case studies to enhance learning. In our last chapter, we’ll pull it all together to see how these concepts can be applied in a database selection process. 11.7 Further reading Apache Accumulo. http://accumulo.apache.org/. ———“Apache Accumulo User Manual: Security.” http://mng.bz/o4s7. ———“Apache Accumulo Visibility, Authorizations, and Permissions Example.” http://mng.bz/vP4N. AWS. “Amazon Simple Storage Service (S3) Documentation.” http:// aws.amazon.com/documentation/s3/. “Discretionary access control.” Wikipedia. http://mng.bz/YB0o. GitHub. ml-rbac-example. https://github.com/ableasdale/ml-rbac-example. “Health Information Technology for Economic and Clinical Health Act.” Wikipe- dia. http://mng.bz/R8f6. “Lightweight Directory Access Protocol.” Wikipedia. http://mng.bz/3l24. MarkLogic. “Common Criteria Evaluation and Validation Scheme Validation Report.” July 2010. http://mng.bz/Y73g. ———“Security Entities.” Administrator’s Guide. http://docs.marklogic.com/ guide/admin/security. ———“MarkLogic Server Earns Common Criteria Security Certification.” August 2010. http://mng.bz/ngJI. National Council for Prescription Drug Programs Forum. http://www.ncpdp.org/ standards.aspx. “Network File System.” NFSv4. Wikipedia. http://mng.bz/11p9. NIAP CCEVS—http://www.niap-ccevs.org/st/vid10306/. “Role-based access control.” Wikipedia. http://mng.bz/idZ7. “The Health Insurance Portability and Accountability Act.” US Department of Health & Human Services. http://www.hhs.gov/ocr/privacy/. W3C. “XML Signature Syntax and Processing.” June 2008. http://www.w3.org/TR/ xmldsig-core.
www.it-ebooks.info Selecting the right NoSQL solution
This chapter covers
Team dynamics in database architecture selection The architectural trade-off process Analysis through decomposition Communicating results Quality trees The Goldilocks pilot
Marry your architecture in haste, and you can repent in leisure. —Barry Boehm (from Evaluating Software Architectures: Methods and Case Studies, by Clements et al.)
If you’ve ever shopped for a car, you know it’s a struggle to decide which car is right for you. You want a car that’s not too expensive, has great acceleration, can seat four people (plus camping gear), and gets great gas mileage. You realize that no one car has it all and each car has things you like and don’t like. It’s your job to
254
www.it-ebooks.info What is architecture trade-off analysis? 255
figure out which features you really want and how to weigh each feature to help you make the final decision. To find the best car for you, it’s important to first understand which features are the most important to you. Once you know that, you can prioritize your requirements, check the car’s specifications, and objectively balance trade-offs. Selecting the right database architecture for your business problem is similar to purchasing a car. You must first understand the requirements and then rank how important each requirement is to the project. Next, you’ll look at the available data- base options and objectively measure how each of your requirements will perform in each database option. Once you’ve scored how each database performs, you can tally the results and weigh the most important criteria accordingly. Seems simple, right? Unfortunately, things aren’t as straightforward as they seem; there are usually com- plications. First, all team stakeholders might not agree on project priorities or their relative importance. Next, the team assigned to scoring each NoSQL database might not have hands-on experience with a particular database; they might only be familiar with RDBMSs. To complicate matters, there are often multiple ways to recombine com- ponents to build a solution. The ability to move functions between the application layer and the database layer make comparing alternatives even more challenging. Despite the challenges, the fate of many projects and companies can depend on the right architectural decision. If you pick a solution that’s a good fit for the prob- lem, your project can be easier to implement and your company can gain a competi- tive advantage in the market. You need an objective process to make the right decision. In this chapter, we’ll use a structured process called architecture trade-off analysis to find the right database architecture for your project. We’ll walk through the process of collecting the right information and creating an objective architecture-ranking sys- tem. After reading this chapter, you’ll understand the basic steps used to objectively analyze the benefits of various database architectures and how to build quality trees and present your results to project stakeholders. 12.1 What is architecture trade-off analysis? Architecture trade-off analysis is the process of objectively selecting a database archi- tecture that’s the best fit for a business problem. A high-level description of this pro- cess is shown in figure 12.1. The qualities of software applications are driven by many factors. Clear require- ments, trained developers, good user interfaces, and detailed testing (both functional and stress testing) will continue to be critical to successful software projects. Unfortu- nately, none of that will matter if your underlying database architecture is the wrong architecture. You can add more staff to the testing and development teams, but chang- ing an architecture once the project is underway can be costly and result in significant delays. For many organizations, selecting the right database architecture can result in mil- lions of dollars in cost savings. For other organizations, selecting the wrong database
www.it-ebooks.info 256 CHAPTER 12 Selecting the right NoSQL solution
Select top four architectures
Find Score effort Gather all Total effort architecturally for each business for each significant requirement for requirements requirements each architecture architecture
Figure 12.1 The database architecture selection process. This diagram shows the process flow of selecting the right database for your business problem. Start by gathering business requirements and isolating the architecturally significant items. Then test the amount of effort required for each of the top alternative architectures. From this you can derive an objective ranking for the total effort of each architecture.
architecture could mean they’ll no longer be in business in a few years. Today, busi- ness stakes are high, and as the number of new data sources increases, the stakes will increase proportionately. Some people think of architecture trade-off analysis as an insurance package. If they have strong exposure to many NoSQL databases, senior architects on a selection team may have an intuitive sense of what the right database architecture might be for a new application. But doing your analysis homework will not only create assurance that the team is right, it’ll help everyone understand why the fit is good. An architecture trade-off analysis process can be completed in a few weeks and should be done at the beginning of your project. The artifacts created by the process will be reused in later phases of the project. The overall costs are low and the benefits of a good architectural match are high. Selecting the right architecture should be done before you start looking at various products, vendors, and hosting models. Each product, be it open source or commer- cial license, will add an additional layer of complexity as it introduces the variables of price, long-term viability of the vendor, and hosting costs. The top database architec- tures will be around for a long time, and this work won’t need to be redone in the short term. We think that selecting an architecture before introducing products and vendors is the best approach. There are many benefits to doing a formal architecture trade-off analysis. Some of the most commonly cited benefits are
Better understanding of requirements and priorities Better documentation on requirements and use cases Better understanding of project goals, trade-offs, and risks Better communication among project stakeholders and shared understanding of concerns of other stakeholders Higher credibility of team decisions These benefits go beyond the decision of what product and what hosting model an organization will use. The documentation produced during this process can be used
www.it-ebooks.info Team dynamics of database architecture selection 257
throughout the project lifecycle. Not only will your team have a detailed list of critical success factors, but these items will be logically categorized and prioritized. Many government agencies are required by law to get multiple bids on software sys- tems that exceed a specific cost amount. The processes outlined in this chapter can be used to create a formal request for proposal (RFP) which can be sent to potential vendors and integrators. Vendors that respond to RFPs in writing are legally obligated to state whether they satisfy requirements. This gives the buyer control of many aspects of a purchase that they wouldn’t normally have. Let’s now review the composition and structure of a NoSQL database architecture selection team. 12.2 Team dynamics of database architecture selection Your goal in an architecture selection project is to select the database architecture that best fits your business problem. To do this, you should use a process that’s objec- tive, fair, and has a high degree of credibility. It would also be ideal if, in the process, you can build consensus with all stakeholders so that when the project is complete, everyone will support the decision and work hard to make the project successful. To achieve this, it’s important to take multiple perspectives into account and weigh the requirements appropriately; see figure 12.2.
Make it easy to Make it easy to use and extend. create and maintain code.
Business unit Developers
Provide a long-term Make it easy to competitive advantage. monitor and scale. Architecture selection team
Marketing Operations
Figure 12.2 The architecture selection team should take into account many different perspectives. The business unit wants a system that’s easy to use and to extend. Developers want a system that’s easy to build and maintain. Operations wants a database that can be easily monitored and scaled by adding new nodes to a cluster. Marketing staff want to have a system that gives them a long-term competitive advantage over other companies.
www.it-ebooks.info 258 CHAPTER 12 Selecting the right NoSQL solution
12.2.1 Selecting the right team One of the most important things to consider before you begin your analysis is who will be involved in making the decision. It’s important to have the right people and keep the size of the team to a minimum. A team of more than five people is unwieldy, and scheduling a meeting with too many calendars is a nightmare. The key is that the team should fairly represent the concerns of each stakeholder and weigh those con- cerns appropriately. If you have a formal voting process to rank feature priorities, it’s good to have an odd number of people on the team or have one person designated as the tie-breaker vote. Here’s a short list of the key questions to ask about the team makeup:
Will the team fairly represent all the diverse groups of stakeholders? Are the team members familiar with the architecture trade-off process? Does each member have adequate background, time, and interest to do a good job? Are team members committed to an objective analysis process? Will they put the goals of the project ahead of their own personal goals? Do the team members have the skills to communicate the results to each of the stakeholders? If the architecture selection process is new to some team members, you’ll want to take some time initially to get everyone up to speed. If done well, this can be a positive shared learning experience for new members of the selection team. These steps are also the initial phase of agreeing, not only on the critical success factors of the project, but the relative priority of each feature of the system. Building consensus in the early phases of a project is key to getting buy-in from the diverse community of people that will fund and support ongoing project operations. Getting your architecture selection team in alignment and even creating enthusias- tic support about your decision involve more than technical decisions. Experience has shown that the early success of a project is part organizational psychology, part com- munication, and part architectural analysis. Securing executive support, a senior proj- ect manager, and open-minded head of operations are all factors that will contribute to the ultimate success of the pilot project. One of the worst possible outcomes in a selection process is selecting a database that one set of stakeholders likes but another group hates. A well-run selection pro- cess and good communication can help everyone realize there’s no single solution that meets everyone’s needs. Compromises must be made, risks identified, and plans put in place to mitigate risk. Project managers need to determine whether team mem- bers are truly committed or using passive-aggressive behavior to undermine the proj- ect. Adopting new technologies is difficult even when all team members are committed to the decision. If there’s division, the process will only be more difficult. Keeping your database selection objective is one of the most difficult parts of any database selection process. As you’ll see, there are some things, such as experience
www.it-ebooks.info Team dynamics of database architecture selection 259
bias and using outside consultants, that impact the process. But if you keep these in mind, you can make adjustments and still make the selection process neutral.
12.2.2 Accounting for experience bias When you perform an architecture trade-off analysis, you must bear in mind that each person involved has their own set of interests and biases. If members of your project team have experience with a particular technology, they’ll naturally attempt to map the current problem to a solution they’re familiar with. New problems will be applied to the existing pattern-matching circuits in their brains without conscious thought. This doesn’t mean the individual is putting self-interest above the goals of the project; it’s human nature. If you have people who are good at what they do and have been doing it for a long time, they’ll attempt to use the skills and experience they’ve used in previous projects. People with these attributes may be most comfortable with existing technologies and have a difficult time objectively matching the current business problems to an unfa- miliar technology. To have a credible recommendation, all team members must com- mit to putting their personal skills and perspectives in the context of the needs of the organization. This doesn’t imply that existing staff or current technologies shouldn’t be consid- ered. People on your architecture trade-off team must create evaluations that will be weighted in ways that put the needs of the organization before their personal skills and experience.
12.2.3 Using outside consultants Something each architecture selection team should consider is whether the team should include external consultants. If so, what role should they play? Consultants who specialize in database architecture selection may be familiar with the strengths and weaknesses of multiple database architectures. But there’s a good chance they won’t be familiar with your industry, your organization, or your internal systems. The cost-effectiveness of these consultants is driven by how quickly they can understand requirements. External consultants can come up to speed quickly if you have well-written detailed requirements. Having well-written system requirements and a good glossary of busi- ness terms that explain internal terminology, usage, and acronyms can lower database selection costs and increase the objectivity of database selection. This brings an addi- tional level of assurance for your stakeholders. High-quality written requirements not only allow new team members to come up to speed, they can also be used downstream when building the application. In the end, you need someone on the team who knows how each of these architectures works. If your internal staff lacks this experience, then an outside consultant should be considered. With your team in place, you’re ready to start looking at the trade-off analysis process.
www.it-ebooks.info 260 CHAPTER 12 Selecting the right NoSQL solution
12.3 Steps in architectural trade-off analysis Now that you’ve assembled an architectural selection team who’ll be objective and represent the perspectives of various stakeholders, you’re ready to begin the formal architectural trade-off process. Here are the typical steps used in this process:
1 Introduce the process—To begin, it’s important to provide each team member with a clear explanation of the architecture trade-off analysis process and why the group is using it. From there the team should agree on the makeup of the team, the decision-making process, and the outcomes. The team should know that the method has been around for a long time and has a proven track record of positive results if done correctly. 2 Gather requirements—Next, gather as many requirements as is practical and put them in a central structure that can be searched and reported on. Require- ments are a classic example of semistructured data, since they contain both structured fields and narrative text. Organizations that don’t use a require- ments database usually put their requirements into MS Word or Excel spread- sheets, which makes them difficult to manage. 3 Select architecturally significant requirements—After requirements are gathered, you should review them and choose a subset of the requirements that will drive the architecture. The process of filtering out the essential requirements that drive an architecture is somewhat complex and should be done by team members who have experience with the process. Sometimes a small requirement can require a big change in architecture. The exact number of architecturally sig- nificant requirements used depends on the project, but a good guideline is at least 10 and not more than 20. 4 Select NoSQL architectures—Select the NoSQL architectures you want to consider. The most likely candidates would include a standard RDBMS, an OLAP system, a key-value store, a column family store, a graph database, and a document data- base. At this point, it’s important not to dive into specific products or imple- mentations, but rather to understand the architectural fit with the current problem. In many cases, you’ll find that some obvious architectures aren’t appropriate and can be eliminated. For example, you can eliminate an OLAP implementation if you need to perform transaction updates. You can also elimi- nate a key-value store if you need to perform queries on the values of a key- value pair. This doesn’t mean you can’t include these architectures in hybrid systems; it means that they can’t solve the problem on their own. 5 Create use cases for key requirements—Use cases are narrative documents that describe how people or agents will interact with your system. They’re written by subject matter experts or business analysts who understand the business context. Use cases should provide enough detail so that an effort analysis can be deter- mined. They can be simple statements or multipage documents, depending on the size of the project and detail necessary. Many use cases are structured
www.it-ebooks.info Steps in architectural trade-off analysis 261
around the lifecycle of your data. You might have one use case for adding new records, one for listing records, one for searching, and one for exporting data, for example. 6 Estimate effort level for each use case for each architecture—For each use case, you’ll determine a rough estimate of the level of effort that’s required and apply a scoring system, such as 1 for difficult and 5 for easy. As you determine your effort, you’ll place the appropriate number for each use case into a spread- sheet, as shown in figure 12.3. 7 Use weighted scores to rank each architecture—In this stage, you’ll combine the effort level with some type of weight to create a single score for each architecture. Items that are critical to the success of the project and that are easy to imple- ment will get the highest scores. Items that are lower in priority and not easy to implement will get lower scores. By adding up the weighted scores, as shown in figure 12.4, you’ll generate a composite score that can be used to compare each architecture. In the first pass at weighting, effort and estimation may be rough. You can start with a simple scale of High, Medium, and Low. As you become more com- fortable with the results, you can add a finer scale of 1–5, using a higher num- ber for lower effort. How each use case is weighted against the others should also help your group build consensus on the relative importance of each feature. Use cases can also be used to understand risk factors of the project. Features marked as critical will need special attention by project managers doing project risk assessment. 8 Document results—Each step in the architecture trade-off process creates a set of documents that can be combined into a report and distributed to your stake- holders. The report will contain starting points for the information you need to communicate to your stakeholders. These documents can be shared in many forms, such as a report-driven website, MS Word documents, spreadsheets and
Figure 12.3 A sample architecture trade-off score card for a specific project with categorized use cases. For simplicity, all of the use cases have the same weighted value.
www.it-ebooks.info 262 CHAPTER 12 Selecting the right NoSQL solution
slide presentations, colorful wall-sized posters, and quality trees, which we’ll review later in this chapter. 9 Communicate results—Once you have your documentation in order, it’s time to present the results. This needs to be done in a way that will focus on building credibility for your process team. Sending a 100-page report as an email attach- ment isn’t going to generate excitement. But if you present your case in an interactive method, you can create a sense of urgency using vocabulary the audience understands. The communication job isn’t done until you’ve ensured that the stakehold- ers understand the key points you want to communicate. Creating an online quiz will usually turn people off. You want to create two-way dialogs that verify the information has been transmitted and understood so the stakeholders can move on to the next stage of the project. An example of a weighted scorecard is shown in figure 12.4. We want to make it clear that we think that selecting an architecture before select- ing a specific NoSQL product is the preferred method. We’ve been involved in situa- tions where a team is trying to decide between two products that use different architectures. The team must then combine both architectural decisions and vendor- specific issues in one session. This prevents the team from understanding the real underlying architectural fit issues. It might be interesting to compare the steps in the trade-off analysis process with a similar process developed by the Software Engineering Institute (SEI) at Carnegie
Figure 12.4 A weighted scorecard for a software selection process for a statute management system. The grid shows the ease of performing a task for four alternative architectures. A score of 0–4 is used, with a 4 indicating the lowest effort. The most critical features have the highest weight in the total score.
www.it-ebooks.info Analysis through decomposition: quality trees 263
Business Quality User drivers attributes stories Analysis Architecture Architectural Architectural plan approaches decisions
Tradeoffs
Impacts Sensitivity points
Non-risks Distilled info Risk themes Risks
Figure 12.5 High-level information flow for the SEI ATAM process. Although this process isn’t specifically targeted at NoSQL database selection, it shares many of the same processes that we recommend in this chapter. Business requirements (driven by marketing) and architectural alternatives (driven by your knowledge of the NoSQL options available) have separate paths into the analysis process. The outcome of this analysis is a set of documents that shows the strengths and weakness of one architecture.
Mellon. A high-level description of the Architecture Trade-off Analysis Method (ATAM) process developed by Carnegie Mellon is shown in figure 12.5. Originally, the ATAM process was designed to analyze the strengths and weaknesses of a single architecture and highlight project risk. It does this by identifying the high- level requirements of a system and determining how easily the critical requirements fit into a given architecture. We’ll slightly modify the ATAM process to compare different NoSQL architectures. 12.4 Analysis through decomposition: quality trees The heart of an objective evaluation is to break a large application into smaller discrete processes and determine the level of effort each architecture demands to meet the sys- tem requirements. This is a standard divide-and-conquer decomposition analysis pro- cess. To use decomposition, you continue to break large pieces of functionality into smaller components until you understand and can communicate the relative strengths and weaknesses of each architecture for the critical areas of your application. Although there are several ways you can use decomposition, the best practice is to break up the overall functionality of an application into small stories that describe how a user will interact with the system. We call these the system’s scenarios or use cases. Each process is documented in a story that describes how actors will interact with a sys- tem. This can be as simple as “loading data” or as complex as “upgrading software with new releases.” For each process, you’ll compare how easy or difficult it is to perform this task on each of the systems under consideration. Using a simple category (easy, medium, hard) or 1–5 ranking is a good way to get started.
www.it-ebooks.info 264 CHAPTER 12 Selecting the right NoSQL solution
One of the keys to objective comparison is to group requirements into logical hier- archies. Each upper node within a hierarchy will have a label. Your goal is to create a hierarchy that has no more than 10 top-tier branches with meaningful labels. Once you’ve selected an architecture, you’re ready to build a quality tree. Quality trees are hierarchies of quality attributes that determine how well your specific situa- tion fits with the system you’re selecting. These are sometimes called the -ilities as a shorthand for properties such as scalability, searchability, availability, agility, maintain- ability, portability, and supportability that we’ve discussed.
12.4.1 Sample quality attributes Now that you have an idea of what quality trees are and how they’re used to classify sys- tem requirements, let’s take a look at some sample qualities that you might consider in your NoSQL database selection.
Scalability—Will you get incremental performance gain when you add addi- tional nodes to a cluster? Some people call this linear or horizontal scalability. How difficult is it to add or remove nodes? What’s the operational overhead when you add or remove nodes from a cluster? Note that read and write scal- ability can have different answers to this question. We try to avoid using the word performance in our architectural quality trees, as it means different things to different people. Performance can mean a num- ber of transactions per second in an RDBMS, the number of reads or writes per second in a key-value store, or the time to display a total count in an OLAP sys- tem. Performance is usually tied to a specific test with a specific benchmark. You’re free to use it as a top-level category as long as you’re absolutely clear that everyone on the team will use the same definition. When in doubt, use the most specific word you can find. One good example of how difficult it is to measure performance is the degree that data will be kept in a caching layer of a NoSQL database. A bench- mark that selects distinct data may run slowly on one system, but it may not rep- resent the real-world query distribution of your application. We covered many issues related to horizontal scale-out in chapter 6 on big data and the issues of referential transparency and caching in chapter 10. Availability—Will the system keep running if one or more components fail? Some groups call this attribute reliability, but we prefer availability. We covered high availability in chapter 8, including the subtleties of measuring perfor- mance and the pros and cons of different distribution models. In chapter 2 and in other case studies, we also discussed that when network errors occur, many NoSQL systems allow you to make intelligent choices about how services should deal with partition tolerance. Supportability—Can your data center staff monitor the performance of the sys- tem and proactively fix problems before they impact a customer? Are there high-quality monitoring tools that show the status of resources like RAM cache,
www.it-ebooks.info Analysis through decomposition: quality trees 265
input and output queries, and faults on each of the nodes in a cluster? Are there data feeds that provide information that conforms to monitoring stan- dards such as SNMP? Note that supportability is related to both scalability and availability. A system that automatically rebalances data in a cluster when a node fails will get a higher score for both scalability and availability as well as supportability. Portability—Can you write code for one platform and port it to another plat- form without significant change? This is one area where many NoSQL systems don’t perform well. Many NoSQL databases have interfaces and query lan- guages that are unique to their product, and porting applications can be diffi- cult. This is one area where SQL databases have an advantage over NoSQL systems that require database developers to use proprietary interfaces. One method you can use to gain portability is to build database-independent wrapper libraries around your database-access functions. Porting to a new data- base could be done by implementing a new version of your wrapper library for each database. As we discussed in chapter 4, building wrappers can be easy when the APIs are simple, such as GET, PUT, and DELETE. Adding a wrapper layer for increased portability becomes more expensive when there are many complex API calls to the database. Sustainability—Will the developers and vendors continue to support this prod- uct or this version of the product in the future? Is the data model shared by other NoSQL systems? How viable are the organizations supporting the soft- ware? If the software is open source, can you hire developers to fix bugs or cre- ate extensions? The way that document fragments are addressed using path expression is a universal data-access pattern common in all document stores. Path expressions will be around for a long time. In contrast, there are many new key-value stores that store structures within values and have their own way of accessing this data that’s unique to that system. It’s difficult to predict whether these will become future standards, so there’s higher risk in these areas. Note that the issues of portability and sustainability are tied together. If your application uses a standard query language, it’ll be more portable, so the con- cerns of sustainability are lower. If your NoSQL solution locks you into a query language that’s unique to that solution, you’ll be locked in to a single solution. Security—Can you provide the appropriate access controls so that only autho- rized users can view the appropriate elements? What level of granularity does the database support? The key element we considered is whether security should be placed within the database layer itself or whether there are methods of keeping security at the application layer. Searchability—Can you quickly find the items you’re looking for in your data- base? If you have large amounts of full text, can you search by using a keyword
www.it-ebooks.info 266 CHAPTER 12 Selecting the right NoSQL solution
or phrase? Can you rank search results according to how closely a record matches your keyword search? This topic was covered in chapter 7. Agility—How quickly can the software be modified to meet changing business requirements? We prefer the term agility over modifiability, since agility is associ- ated more closely with a business objective, but we’ve seen both terms in use. This topic was covered in chapter 9.
Beware of vendor-funded performance benchmarks and claims of high availability Several years ago, one of us was working on a database selection project when a team member produced a report which claimed that vendor A had a tenfold perfor- mance advantage over the other vendors. It took our team several days of research to find that the benchmark was created by an “independent” company that was paid by vendor A. This so-called independent company selected samples and tuned the configuration of the funding vendor to be optimized. No other databases were tuned for the benchmark. Most external database performance benchmarks are targeting generic read, write, or search metrics. They won’t be tuned to the needs of a specific project and as such aren’t useful for making valid comparisons. We avoid using external benchmarks in comparing performance. If you’re going to do performance bake-offs, you must make sure that each database is configured appro- priately and that experts from each vendor are given ample time to tune both the hard- ware and software to your workload. Make sure the performance benchmark will precisely match the data types you’ll use and use your predicted peak loads for reads, writes, updates, and deletes. Also make sure that your benchmark includes repeated read queries so that the impact of caching can be accounted for. Vendor claims of high availability should also be backed up by visits to other custom- ers’ sites that can document service levels. If something sounds too good to be true, it probably is. Any vendor with astounding claims should back up the claims with astounding evidence.
12.4.2 Evaluating hybrid and cloud architectures It’s usually the case that no single database product will meet all the needs of your application. Even full NoSQL solutions that leverage other tools such as search and indexing libraries may have gaps in functionality. For items like image management and read-mostly BLOB storage, you might want to blend a key-value store with another database. This frequently requires the architectural assessment team to carefully parti- tion the business problem into areas of concern that need to be grouped together. One of the major themes of this book has been the role of simple, modular, data- base components that can snap together like Lego blocks. This modular architecture makes it easier to customize service levels for specific parts of your application. But it also makes it difficult to do an apples-to-apples comparison. Running a single
www.it-ebooks.info Communicating the results to stakeholders 267
database cluster might be operationally more efficient than running one cluster for data, one cluster for search, and one cluster for key-value stores. This is why in many of our evaluations we include a separate column for a hybrid solution that uses different components for different aspects of an application. Sup- porting multiple clusters can allow the team to use specialized databases that will be tuned for different tasks. Tuning each cluster needs its own operational budget extending out over multiple years. A best practice is to use cloud-based services whenever possible. These services can reduce your overhead operational costs and, in some cases, make them more predict- able. Predicting the costs of cloud-based NoSQL services can be complicated, since the services chart is based on many factors, such as disk spaced used, the number of bytes of input and output, and the number of transactions. There are times when cloud computing shouldn’t be used—say, if your business problem requires data to be constantly moved between a private data center and the cloud. Storing all static resources on a public cloud works well when your users are based on the internet. This means that the I/O of users hitting your local site will be much lower. Since I/O is sometimes a key limitation for a website, removing static data from the site can be a great way to lower overall cost. As you complete your analysis, you begin thinking about the best way to communi- cate your findings and recommendations to your team and stakeholders and move forward with the project. 12.5 Communicating the results to stakeholders To move your NoSQL pilot project forward, you need to present a compelling argu- ment to your stakeholders. Frequently, those who make this important decision are also the least technical people in the organization. To receive serious consideration, you must communicate your results and findings clearly, concisely, and in a way every- one understands. Drawing detailed architecture diagrams will only be valuable if your intended audience understands the meaning of all the boxes and lines in the diagram. Many projects do select the best match of a business problem to the right NoSQL architecture, yet they fail to communicate the details of why a decision was made to the right people and so the endeavor fails. The architecture trade-off analysis process should be integrated with the project communication plan. To create a successful out- come, you need to trade your analyst hat for a marketing hat.
12.5.1 Using quality trees as navigational maps A goal of this book is to create a vocabulary and a set of tools for helping everyone, including nontechnical staff, to understand the key issues of database selection. One of the most difficult parts of presenting results to your stakeholders is creating a shared navigational map that helps everyone orient themselves to the results of the evaluation. Navigational maps help stakeholders determine whether their concerns have been addressed and how well the solution meets their concerns.
www.it-ebooks.info 268 CHAPTER 12 Selecting the right NoSQL solution
For example, if a group of stakeholders has expressed concerns about using roles to grant users access to data, they should be able to guess that roles are listed in your map in the area labeled Security. From within that part of the map, they will see how role-based access is listed under the authorization area. Although there are multiple ways to display these topic maps, one of the most popular is a tree-like diagram that places the highest category of concern (the -ilities) on the left side of a tree, and then presents increasing levels of detail as you move to the right. We call these structures quality trees. An example of a quality tree is shown in figure 12.6. This example uses a three-level structure to break down the concerns of the proj- ect. Each of the seven qualities can be further broken down into subfeatures. A typical single-page quality tree will usually stop at the third level of detail because of the limi- tations of reading small fonts on a single page. Software tools can allow interactive sites to drill further down to a requirements database and view use cases for each requirement.
Solution support
Full-text search Full-text search on document data (H, H)
Searchability XML search Fast range searching (H, H) How important Custom scoring Score rank via XQuery (M, H) to this project?
Drag-and-drop Easy to add new XML data (C, H) XML importability Bulk import Batch import tools (M, M)
XQuery Easy to transform XML data (H, H) Transformability Transform to HTML or PDF (VH, H) XSLT Works with web standards (VH, H)
No license fees Use open source software (H, H) Affordability No complex languages to learn (H, H) Standards-based Staff training (H, M)
Standards Use established standards (VH, H) Sustainability Use declarative languages (VH, H) Ease of change Customizable by nonprogrammers (H, H)
Web services Use W3C standards (VH, H) Interoperability Mashups with REST interfaces (VH, H) No DB translations Works with W3C Forms (H, M)
Fine-grain control Prevents unauthorized access (H, H) Security Collection-based access (H, M) Role-based Centralized security policy (H, M)
Figure 12.6 Sample quality tree—how various qualities (-ilities) are grouped together and then scored. High-level categories on the left are broken down into lower-level categories. A short text describing the quality is followed by letter codes to show how important a feature is (C=Critical, VH=Very High, H=High, M=Medium, L=Low) and how well the proposed solution meets the requirement.
www.it-ebooks.info Communicating the results to stakeholders 269
Quality trees are one method of using graphical presentations to quickly explain the process and findings of your project to stakeholders. Sometimes the right image or metaphor will make all the difference. Let’s see how finding the right metaphor can make all the difference in the adoption of new technologies.
12.5.2 Apply your knowledge Sally is working on a team that has looked at two options to scale out their database. The company provides data services for approximately 100 large corporate customers who run reports on only their own data. Only a small number of internal reports span multiple customers’ files. The team is charged with coming up with an architecture that can scale out and allow the customer base to continue to grow without impacting database performance. The team agrees on two options for distributing the customer data over a cluster of servers. Option A puts all customer data on a single server. This architecture allows for scale-out and new database servers can be added as the customer base grows. When new servers are added, they’ll use the first letter of the customer’s name to divide the data between the servers. Option B distributes customer records evenly across multi- ple servers using hashing. Any single customer will have their data on every node in the cluster and replicated three times for high availability. Sally’s team is concerned that some stakeholders might not understand the trade- offs and believe that storing all customer data on a single node with replication might be a better option. But if they implement option A, half of the nodes in the cluster will be idle and only in use when a failure occurs. The team wants to show everyone how evenly distributing the queries over the clus- ter will return faster results even if queries are running on each node in the cluster. Yet, after a long discussion, the team can’t convince their stakeholders of the advan- tages of their plan. One of the team members has to leave early. He has to catch a plane and is worried about the long lines at the airport. Suddenly, Sally has an idea. At their next meeting Sally presents the drawings in figure 12.7. The next time the group meets, Sally uses her airport metaphor to describe how most of the nodes in the cluster would remain idle, even when a customer is waiting many hours for long reports to run. By evenly distributing the data over the cluster, the load will be evenly distributed and reports will run on every node until the report finishes. No long lines! Sally’s metaphor works and everyone on the project understands the consequences of uneven distribution of data in a cluster and how costs can be lower and perfor- mance improved if the right architecture is used. Though you can create a detailed report of why one scale-out architecture works better than another using bulleted lists, graphics, models, or charts, using a metaphor everyone understands gets to the heart of matter quickly and drives the decision-making process. Like buying a car, there will seldom be a perfect match for all your needs. Every year the number of specialized databases seems to grow. Not many free and stan- dards-compliant open source software systems scale well and store all data types
www.it-ebooks.info 270 CHAPTER 12 Selecting the right NoSQL solution
CPU utilization Airline check-in
Customer A Customer B Customer C
This is like the situation when many people are forced to go to the longest line, even when other agents are not busy.
This customer is running a report that takes a long time to run on a single system. With a NoSQL solution, the report would be evenly distributed on all servers and run in half the time.
Figure 12.7 Rather than only creating a report that describes how one option provides better distribution of load over a cluster, use graphs and metaphors to describe the key differences. While many people have seen CPU utilization charts, waiting in line at the airport is a metaphor they’ve most likely experienced.
automatically without the need for programmers and operations staff. If they did, we would already be using them. In the real world, we need to make hard choices.
12.5.3 Using quality trees to communicate project risks As we mentioned, the architectural trade-off process creates documentation that can be used in subsequent phases of the project. A good example of this is how you can use a quality tree to identify and communicate potential project risks. Let’s see how you might use this to communicate risk in your project. A quality tree contains a hierarchy of your requirements grouped into logical branches. Each branch has a high-level label that allows you to drill down into more detailed requirements. Each leaf of the tree has a written requirement and two scores. The first score is how important the feature is to the overall project. The second is how well the architecture or product will implement this feature. Risk analysis is the process of identifying and analyzing the gaps between how important a feature is and how well an architecture implements the feature. The greater the gap, the greater the project risk. If a database doesn’t implement a feature that’s a low priority to your project, it may not be a risk. But if a feature is critical to project success and the architecture doesn’t support it, then you have a gap in your architecture. Gaps equal project risk, and communicating potential risks and their impact to stakeholders is necessary and important, as shown in figure 12.8.
www.it-ebooks.info Finding the right proof-of-architecture pilot project 271
A solid border indicates no problems. The architecture and product meet the requirements of the project.
Availability
This quality has been ranked as a Scalability The highest warning rolls medium level of importance to the up to each parent node. project but the architecture has low portability to other databases. Maintainability
Portability Alternate DBs Code can be ported to other databases (M, L)
Affordability This quality has been ranked as a high A dashed line indicates level of importance to the project, but the architectural risk. implementation being considered has been Interoperability evaluated as of low compliance. This helps rank project risks. Sustainability Authentication
Authorization Only some roles can update Security some records (H, L) Audit
Encryption
Figure 12.8 An example of using a quality tree to communicate risk-gap analysis. You can use color, patterns, and symbols to show how gaps in lower-level features will contribute to overall project risk. The way we do this is by having each branch in our tree inherit the highest level of risks from a subelement.
12.6 Finding the right proof-of-architecture pilot project After you’ve finished your architecture trade-off analysis and have chosen an architec- ture, you’re ready for the next stage: a proof-of-architecture (POA) pilot project sometimes called a proof-of-concept project. Selecting the right POA project isn’t always easy, and selecting the wrong project can prevent new technologies from gaining acceptance in an organization. The most critical factor in selecting a NoSQL pilot project is to identify a project with the right properties. In the same way Goldilocks waited to find the item that was “just right,” you want to select a pilot project that’s a good fit for the NoSQL technol- ogy you’re recommending. A good POA project looks at
1 Project duration—Pilot projects should be of a medium duration. If the project can be completed in less than a week, it might be discarded as trivial. If the proj- ect duration is many months, it becomes a target for those opposed to change and new technology. To be effective, the project should be visible long enough to achieve strategic victories without being vulnerable to constantly changing budgets and shifts in strategic direction.
www.it-ebooks.info 272 CHAPTER 12 Selecting the right NoSQL solution
Though this may sound like an overly cautious tone, introducing new tech- nology into an organization can be a battle between old and new ideas. If you assume that enemies of change are everywhere, you can adjust plans to ensure victory. 2 Technology transfer—Using an external consultant to build your pilot project without interaction with your staff prevents internal staff from understanding how the systems work and getting some exposure to the new technology. It pre- vents your team from finding allies and building bridges of understanding. The best approach is to create teams that will pair experienced NoSQL developers with internal staff who will support and extend the databases. 3 The right types of data—Each NoSQL project has different types of data. It may be binaries, flat files, or document-oriented. You must make sure the pilot project data fits the NoSQL solution you’re evaluating. 4 Meaningful and quality data—Some NoSQL projects are used in pilot projects because there’s high-variability data. This is the right reason to move from a SQL database. But the data still needs to have strong semantics and data quality. NoSQL isn’t a magic bullet that will make bad data good. There are many ways that schema-less systems can ingest and run statistical analysis on data and inte- grate machine learning to analyze data. 5 Clear requirements and success criteria—We’ve already covered the importance of understandable, written requirements in NoSQL projects. A NoSQL pilot proj- ect should also have clearly defined written success criteria associated with mea- surable outcomes that the team agrees to. Finding the right pilot project to introduce a new NoSQL database is critical for a project’s successful adoption. A NoSQL pilot project that fails, even if the reason for failure isn’t associated with architectural fit, will dampen an organization’s willingness to adopt new database technologies in the future. Projects that lack strong executive support and leadership can be difficult, if not impossible, to move forward when prob- lems arise that aren’t related to the new architecture. Our experience is that getting high-quality data into a new system is sometimes the biggest problem in getting users to adopt new systems. This is often not the fault of the architecture, but rather a symptom of the organization’s ability to validate, clean, and manage metadata. Before you proceed from the architecture selection phase to the pilot phase, you want to stop and think carefully. Sometimes a team is highly motivated to get off of older architectures and will take on impossible tasks to prove that new architectures are better. Sometimes the urge to start coding immediately overrides common sense. Our advice is to not accept the first project that’s suggested, but to find the right fit before you proceed. Sometimes waiting is the fastest way to success.
www.it-ebooks.info Summary 273
12.7 Summary In this chapter, we’ve looked at how a formal architecture trade-off process can be used to select the right database architecture for a specific business project. When architectures were few in number, the process was simple and could be done infor- mally by a group of in-house experts. There was no need for a detailed explanation of their decisions. But as the number of NoSQL database options increases, the selection process becomes more complex. You need an objective ranking system that helps you narrow down the options and then compares the trade-offs. After reading the cases studies, we believe with certainty that the NoSQL move- ment has and will continue to trigger dramatic cost reductions in building applica- tions. But the number of new options makes the process of objectively selecting the right database architecture more difficult. We hope that this book helps guide teams through this important but sometimes complex process and helps save both time and money, increasing your ability to adapt to changing business conditions. When Charles Darwin visited the Galapagos Islands, he collected different types of birds from many of the islands. After returning to England, he discovered that a single finch had evolved into roughly 15 different species. He noted that the size and shape of the birds’ beaks had changed to allow the birds to feed on seeds, cacti, or insects. Each island had different conditions, and in time the birds evolved to fit the requirements. Seeing this gradation and diversity of structure in one small, intimately related group of birds, one might really fancy that from an original paucity of birds in this archipelago, one species had been taken and modified for differ- ent ends. (Charles Darwin, Voyage of the Beagle) NoSQL databases are like Darwin’s finches. New species of NoSQL databases that match the conditions of different types of data continue to evolve. Companies that try to use a single database to process different types of data will in time go the way of the dinosaur. Your task is to match the right types of data to the right NoSQL solutions. If you do this well, you can build organizations that are healthy, insightful, and agile, and that can take advantage of a changing business climate. For all the goodness that diversity brings, standards are still a must. Standards allow you to reuse tools and training and to leverage prebuilt and preexisting solutions. Metcalf’s Law, where the value of a standard grows exponentially as the number of users increases, applies to NoSQL as well as to network protocols. The diversity- standardization dilemma won’t go away; it’ll continue to play a role in databases for decades to come. When we write reports for organizations considering NoSQL pilot projects, we imagine Charles Darwin sitting on one side and Robert Metcalf sitting on the other— two insightful individuals using the underlying patterns in our world to help organiza- tions make the right decision. These decisions are critical; the future of many jobs depends on making the right decisions. We hope this book will guide you to an enlightening and profitable future.
www.it-ebooks.info 274 CHAPTER 12 Selecting the right NoSQL solution
For up-to-date analysis of the current portfolio of SQL and NoSQL architectures, please refer to http://manning.com/mccreary/, or go to http://danmccreary.com/ nosql. Good luck! 12.8 Further reading Clements, Paul, et al. Evaluating Software Architectures: Methods and Case Studies. 2001, Addison-Wesley. “Darwin’s finches.” Wikipedia. http://mng.bz/2Zpa. SEI. “Software Architecture: Architecture Tradeoff Analysis Method.” http://mng.bz/54je.
www.it-ebooks.info index
A Amazon DynamoDB resources for 190 high-availability rowkey in 185 access control, in strategies 182–184 Apache Flume RDBMSs 48–49 overview 11–12 event log processing case ACID principles, transactions resources for 190 study using 25–27 Amazon EC2 133 challenges of 147–148 ACL (access-control lists) Amazon S3 gathering distributed 247 durability of 190 event data 148–149 ACM (access-control and key-value stores 71–72 overview 149–150 mechanisms) 247 security in website 153 actor model 227 access-control lists 247 Apache HBase 6 adapting software. See agility bucket policies 248–249 Apache Hibernate 200 aggregates Identity and Access Apache Lucene 154–155, 159 ad hoc reporting using Management 247 Apache Solr 154–155, 164 55–57 overview 246–247 API (application programming aggregate tables 53 SLA case study 176 interface) 68 defined 54, 132 Amazon SimpleDB 222 app-info.xml file 105 agile development 193 AMP (amplified application collections 88 agility permission) 251 application layers 235 architecture trade-off ampersand ( & ) 249 application tiers 17–21 analysis 266 analytical systems 52–53 application-level security local vs. cloud-based AND operations 249 236–237 deployment 195–196 annual availability design 176 Architecture Trade-off Analy- measuring 196–199 annual durability design 176 sis Method process. See resources for 206 ANY consistency level 184 ATAM using document stores Apache Accumulo assertions 74 199–200 data stores 6 associative array 65 using XRX to manage com- key visibility in 249–250 asymmetric cryptography 239 plex forms resources 253 ATAM (Architecture Trade-off agility and 205 Apache Ant 101 Analysis Method) complex business forms, Apache Cassandra process 262, 274 defined 201–202 Amazon DynamoDB building quality attribute without JavaScript and 182 trees 202–205 case study 184–185 evaluating hybrid and ALL consistency level 185 data stores 6 cloud All Users group 247 keyspace in 185–186 architectures 266– AllegroGraph 6 partitioner in 185 267
275
www.it-ebooks.info 276 INDEX
ATAM (Architecture Trade-off B Bigtable 11 Analysis Method) process, binding, for XRX 202 building quality attribute backing up BIT type 47 trees (continued) logging activity 243 BLOB type 47, 100 example attributes 264–266 using database triggers 102 books, queries on 108 overview 263–264 BASE principles 27–28 Boolean search 155–156 communicating with stake- basic access authentication 238 boost values holders basic availability, BASE defined 162 example of 269–270 principles 27 overview 167–168 overview 267 BEGIN TRANSACTION bottlenecks 186 project risks 270 statements 25, 46 brackets 99 using quality trees as navi- benchmarks 266 Brewer, Eric 30 gational maps Berkeley DB buckets 71, 187, 240, 248–249 267–269 data stores 6 bulk image processing 129 defined 255–257 history of 65 business agility 8 POA pilot project 271–272 BetterForm 204 business drivers resources 274 BI (business intelligence) agility 8 steps of 260–263 applications 52 variability 7 teams big data velocity 7 experience bias 259 defined 128–131 volume 7 selecting 258–259 distribution models 137–138 business intelligence applica- using outside event log processing case tions. See BI applications consultants 259 study atomicity, ACID principles 25 challenges of 147–148 C auditing 242–243 gathering distributed event Authenticated Users group 247 data 148–149 authentication C++ health care fraud discovery and NoSQL 209 basic access case study authentication 238 history of functional detecting fraud using programming 215 digest authentication 238 graphs 151–152 Kerberos protocol cache overview 150 cached items, defined 93 authentication 239 issues with 135–136 multifactor in NetKernel 221 linear scaling keeping current with consis- authentication 239 – – and expressivity 133 134 tent hashing 22–24 overview 237 238 overview 132–133 public key call stack 224 NoSQL solution for CAP theorem authentication 239 distributing queries to data SASL high availability and 174 239 nodes 146 authorization overview 30–32 moving queries to overview 240–241 case studies data 143–144 UNIX permission Amazon Dynamo 11–12 using hash rings to distrib- model 241–242 comparison 8–13 ute data 144 using roles 242 Google Bigtable 11 using replication to scale autocomplete in complex busi- Google MapReduce 10–11 reads 145–146 ness forms 202 LiveJournal Memcache 9 resources for 153 automatic failover MarkLogic 12 shared-nothing defined 177 categories 54 architecture 136–137 in Erlang 228 CDNs (content distribution availability using MapReduce networks) 132 and distributed channels, defined 149 and BASE systems 27 – architecture trade-off filesystems 140 142 CHAR type 47 analysis 264 efficient transformation checksum. See hashes – of column family stores 84 142 143 ChemML 167 – predicting 176–177 overview 139 140 Church, Alonzo 215 targets for 174 Bigdata (RDF data store) 6 CLI (command-line See also high availability BIGINT type 47 interface) 197
www.it-ebooks.info INDEX 277 client yield 177 Common Warehouse Meta- vs. Couchbase 187 Clojure 215, 222 model. See CWM data stores 6 cloud-based deployment communication with stakehold- Erlang and 222 195–196 ers CPU processor 21 cluster of unreliable commodity example of 269–270 cross data center replication. hardware. See CouchDB overview 267 See XDCR clusters 20 project risks 270 CSV (comma-separated value) COBOL 215 using quality trees as naviga- files cognitive style for functional tional maps 267–269 vs. JSON 98 programming 225 compartmentalized RDBMS tables and 98 cold cache 92 security 251 vs. XML 98 collections complex business forms CWM (Common Warehouse collection keys 94 201–202 Metamodel) 56 hierarchies, native XML data- complex content 89 Cyberduck 102 bases concentric ring model 233–234 customizing update behav- concepts of NoSQL D ior with database application tiers 17–21 triggers 102 CAP theorem 30–32 Data Architect 136 defined 102 consistent hashing to keep data architecture patterns grouping documents by cache current 22–24 column family stores access horizontal scalability with benefits of 83–84 – permissions 103 105 sharding 28–30 Google Maps case storing documents of dif- simplicity of study 85–86 ferent types 103 components 15–17 key structure in 82–83 in document stores strategic use of RAM, SSD, overview 81–82 application collections 88 and disk 21–22 storing analytical overview 88 transactions information 85 triggers 164 ACID 25–27 storing user collisions, hash 23 BASE 27–28 preferences 86 column family store overview 24–28 defined 38, 63 Apache Cassandra case study concurrency 226 document stores – keyspace 185 186 conditional statements APIs 89 – overview 184 185 224–225 application collections 88 partitioner 185 conditional views for complex collections in 88 rowkey 185 business forms 201 CouchDB example 91 benefits of consistency MongoDB example 90–91 easy to add new data 84 ACID principles 25 overview 86–87 higher availability 84 CAP theorem 30 graph stores higher scalability 84 levels, in Apache link analysis using 76–77 vs. column store 81 Cassandra 184–185 linking external data with data store types 6 consistent hashing 22–24 RDF standard 74–75 defined 63 consultants, outside 259 overview 72–74 Google Maps case study processing public – content distribution networks. 85 86 datasets 79–81 – See CDNs key structure in 82 83 context help in complex busi- rules for 78–79 overview 81–82 ness forms 201 high-availability systems storing analytical Couchbase 6, 91 57–58 information 85 case study 187–188 key-value stores storing user preferences 86 vs. CouchDB 187 Amazon S3 71–72 column store 39, 81 data stores 6 benefits of 65–68 comma separated value files. Erlang and 222 defined 64–65 See CSV files CouchDB (cluster of unreliable storing web pages in 70–71 command-line interface. commodity hardware) 6, using 68–70 See CLI 91 OLAP COMMIT TRANSACTION case study using 91 ad hoc reporting using statements 46 aggregates 55–57
www.it-ebooks.info 278 INDEX data architecture patterns, digital signatures 244–245 DOUBLE type 47 OLAP (continued) dimension tables 54 downtime concepts of 54–55 directory services 57 and Amazon Dynamo 11 data flow from operational dirty read 43 high availability 174 systems to analytical discretionary access control. DRAM, resources 33 systems 52–53 See DACL DRCs (distributed revision con- vs. OLTP 51–52 distinct keys 69 trol systems) 59 RDBMS features distributed filesystems 179–180 DSL (domain-specific access control 48–49 distributed revision control sys- languages) 168–169 data definition tems. See DRCs Dublin Core properties 157 language 47 distributed storage system 11 durability, ACID principles 26 replication 49–51 distributed stores 92–93 DW (data warehouse) synchronization 49–51 distributing data, using hash applications 52, 235–236 transactions 45–47 rings 144 DynamoDB 6, 92 typed columns 47 distribution models, for big data stores 6 read-mostly systems 57–58 data 137–138 overview 11–12 revision control systems DNA sequences 158 resources for 95 58–60 DNS (Domain Name row-store pattern System) 57 E evolving with DocBook – organization 41 42 resources for 171 EACH_QUORUM consistency – example using joins 43 45 searching in 162 level 185 – overview 39 41 searching technical effort level 261 – – pros and cons 42 43 documents 166 168 Elastic Compute Cloud. variations of NoSQL architec- document nodes 12 See Amazon EC 2 tural patterns document store ElasticSearch 154, 164 – customizing RAM agility and 199 200 elevated security 251 stores 92 APIs 89 EMC XForms 204 customizing SSD stores 92 application collections 88 encryption 243–245 – distributed stores 92 93 collections in 88 END TRANSACTION – grouping items 93 94 Couchbase case study statements 25, 46 – data definition language. See 187 188 enterprise resource planning. DDL CouchDB example 91 See ERP data stores 81 data store types 6 entity extraction data warehouse applications. defined 63 defined 77, 160 – See DW applications MongoDB example 90 91 resources for 171 – database triggers, customizing overview 86 87 ENUM type 47 update behavior with 102 documentation of architecture Ericsson 222 – database-level security 236 237 trade-off analysis 261 Erlang 91 DATE type 47 documents case study 226–229 DATETIME type 47 protecting using MarkLogic overview 222 – DDL (data definition RBAC model 250 251 ERP (enterprise resource , language) 40 47 queries on 108 planning) 38 DEBUG level 147 searching technical ERROR level 147 – DECIMAL type 47 overview 166 167 ETags 103 declarative language 131 retaining document struc- ETL (extract, transform, and , declarative programming 210 ture in document load) 53, 139, 183 – 231 store 167 168 Evaluating Software delete command 68 structure, and Architectures 254, 274 – delete operation 109 searching 161 162 event log data delete trigger type 102 updating using XQuery big data uses 130 – denial of service attacks 109 110 case study – 245 246 Domain Name System. See DNS challenges of 147–148 dictionary 64 domain-specific languages. gathering distributed event digest authentication 238 See DSL data 148–149
www.it-ebooks.info INDEX 279 eventual consistency, BASE for, let, where, order, and G principles 27 return statement. See eventually consistent reads 183 FLWOR statement game data 130 eXist Foreign Relations of the United Geographic Information Sys- -db 6 States documents. See tem. See GIS native XML database 123, FRUS geographic search 157 241 formats for NoSQL systems 4 get command 68 EXPath forms, complex business GIS (Geographic Information defined 115 201–202 System) 83 resources for 95, 123 four nines 174 Git 58–59 standards 112 four-translation web-object- golden thread pattern 220 experience bias on teams 259 RDBMS model 199 Google expressivity, and linear FRUS (Foreign Relations of the Bigtable 11 scaling 133–134 United States) influence on NoSQL 155 extending XQuery 115 documents 116 MapReduce 10–11 Extensible Stylesheet Language full-text searches Google Analytics 85 Transformations. See XSLT defined 157 Google Earth 83 extract, transform, and load. overview 155–156 Google Maps case study 85–86 See ETL resources 123 governmental regulations 232 using XQuery 110 graph store F functional programming data store types 6 defined 210 defined 63 – F# 223 Erlang case study 226 229 link analysis using 76–77 faceted search history of 215 linking external data with – defined 157 languages 222 223 RDF standard 74–75 resources for 171 parallel transformation overview 72–74 – fact tables 54 and 213 216 processing public failback 177 referential datasets 79–81 – failed login attempts 243 transparency 217 218 rules for 78–79 failure detection 228 resources for 231 grouping documents 103–105 failure of databases 172 transitioning to FATAL level 147 cognitive style 225 H fault identification 228 concurrency 226 Federal Information Processing removing loops and Hadoop 10 – Standard. See FIPS conditionals 224 225 resources for 153 – federated search unit testing 225 226 reverse indexes and 165 defined 146 using functions as UNIX permission model resources 153 parameters 223 241 financial derivatives case study using immutable Hadoop Distributed File Sys- business benefits 121–122 variables 224 tem. See HDFS project results 122 using recursion 224 hard disk drives. See HDDs RDBMS system is using NetKernel to optimize harvest metric 177 disadvantage 119 web content hash rings 144 switching from multiple assembling content hash trees – RDBMSs to one native 219 220 in revision control XML system 119–121 optimizing component systems 58–60 – FIPS (Federal Information Process- regeneration 220 222 resources for 61 ing Standard) 244 vs. imperative programming hashes – five nines 174, 190 overview 211 213 collisions, defined 23 fixed data definitions 43 resources for 231 defined 23 – FLOAT type 47 scalability 216 217 for SQL query 9 FLWOR (for, let, where, order, functions, using as HBase 140 and return) statement 107 parameters 223 as Bigtable store 85 F-measure metric 162 FunctX library 115 on Windows 181 , for loops 213 fuzzy search logic 156 158 HDDs (hard disk drives) 21–22
www.it-ebooks.info 280 INDEX
HDFS (Hadoop Distributed File horizontal scaling in-node indexes System) 85, 130, 139 and programming reverse indexes and 164 high-availability languages 209–210 vs. remote search strategies 180–182 architecture trade-off services 163–164 locations of data stored 141 analysis 264 insert operation 109 resources 153 defined 7 INSERT statements 40 health care fraud discovery case with sharding 28–30 insert trigger type 102 study HTML (Hypertext Markup INT type 47 detecting fraud using Language) 100 INTEGER type 47 graphs 151–152 Hypertable 6 integrated development envi- overview 150 ronment. See IDE Health Information Privacy I isolation Accountability Act. ACID principles 26 See HIPAA in Erlang 228 IAM (Identity and Access resources for 61 Health Information Technology Management) 247 for Economic and Clinical IBM Workplace 204 J Health Act. See HITECH IDE (integrated development Act environment) 197 Java high availability idempotent transforms 217, and NoSQL 209 Apache Cassandra case study 231 – history of functional keyspace 185 186 Identity and Access Manage- – programming 215 overview 184 185 ment (IAM). See IAM Java Hibernate 8 partitioner 185 -ilities 264 JMeter 177, 190 rowkey 185 image processing 129 joins CAP theorem 30 immutable data streams 147 absence of 4 Couchbase case study immutable variables 224 – in RDBMSs 42 187 188 imperative programming defined 173–174 using with row-store transitioning to functional pattern 43–45 measuring programming Amazon S3 SLA case JSON (JavaScript Object cognitive style 225 Notation) 89 study 176 concurrency 226 overview 174–175 search for documents 156 removing loops and uses for 98 predicting system – conditionals 224 225 vs. CSV 98 availability 176–177 – unit testing 225 226 vs. XML 98 resources for 190 using functions as strategies XQuery and 206 parameters 223 JSONiq 109, 123, 206 using Amazon using immutable DynamoDB 182–184 variables 224 K using distributed using recursion 224 – filesystems 179 180 vs. functional programming – Katz, Damien 91 using HDFS 180 182 overview 211–213 Kerberos protocol using load balancers 178 resources for 231 authentication 239 using managed NoSQL scalability 216–217 key word in context. See KWIC services 182 indexes – key-data stores 65 systems for 57 58 creating reverse indexes keys, in column family See also availability using MapReduce 164– stores 82–83 HIPAA (Health Information 165 keyspace management Privacy Accountability in-node vs. remote 163–164 Act) 232, 246, 253 Apache Cassandra 185–186 n-gram 158 defined 144 HITECH Act (Health Informa- range index, defined 158 tion Technology for Eco- key-value store reverse indexes, defined 159 Amazon S3 71–72 nomic and Clinical Health InfiniteGraph 6 Act) 232, 253 benefits of INFO level 147 lower operational HIVE language 222 – injection attacks 245 246 costs 67–68
www.it-ebooks.info INDEX 281 key-value store, benefits of loading data, into native XML Mathematica 222 (continued) databases 101–102 MathML 167 portability 67–68 local deployment, agility MD5 algorithm precision service levels 66 and 195–196 digest authentication 238 precision service monitor- LOCAL_QUORUM consis- for hashes 23 ing and tency level 184 resources 33 notification 67 locking MDX language 54, 56 reliability 67 in ACID principles 26 measuring availability scalability 67 using with native XML agility 196–199 data store types 6 databases 103 Amazon S3 SLA case defined 63–65 LOD (linked open data) study 176 storing web pages in 70–71 defined 79 defined 54 using 68–70 resources for 95 overview 174–175 keywords log data predicting system density of 159 big data uses 130 availability 176–177 finding documents closest error levels for 153 Medicare fraud 151 to 157 using database triggers 102 MEDIUMBLOB type 47 in context 160 Log4j system 147 MEDIUMINT type 47 Kuhn, Thomas 6 Log4jAppender 149 Memcache 92 KWIC (key word in logging 242–243 data stores 6 context) 160 LONGBLOB type 47 overview 9 loops memory cache 22 L functional programming vs. Mercurial 59 imperative programming Merkle trees 59 213 lambda calculus 215, 231 message passing 227 transitioning to functional message store 50 languages programming 224–225 for MapReduce 231 metadata, for XML files 101 functional programming mirrored databases 49 222–223 M missing updates problem 103 last logins 243 misspelled words, in search 161 last updates 243 maintainability 264 mixed content documents 96 layers, application 235 managed NoSQL services 182 Mnesia database 228 , , LDAP (Lightweight Directory map function 10 65 81 mobile phone data 130 Access Protocol) 238, 253 MapReduce model leading wildcards 160 and security policy 234 for security – linear scaling creating reverse indexes overview 233 234 – and expressivity 133–134 using 164 165 using data warehouses – architecture trade-off languages for 231 235 236 – analysis 264 overview 10 11 using services 235 defined 5 MarkLogic 6 for XRX 202 overview 132–133 case study using 119 modifiability 266 link metadata 75 data stores 6 Mondrian 56 linked open data. See LOD overview 12 MongoDB RBAC model , LinkedIn 76 10gen 90 194 advantages of 252 LISP 215, 222 agility in 206 overview 250 – system 10 case study using 90 91 protecting documents live software updates in data stores 6 250–251 Erlang 228 productivity study 194 secure publishing using LiveJournal, Memcache 9 resources for 95 251–252 load balancers monthly SLA commitment 176 resources 123 high-availability multifactor authentication 239 security resources 253 strategies 178 multiple processors 45 mashups, defined 72 pool of 178 Murphy's Law, and ACID master-slave distribution principles 27 loader, defined 212 model 137–138
www.it-ebooks.info 282 INDEX mutable variables 224 functional programming strategic use of RAM, SSD, MVCC 91 language 107 and disk 21–22 overview 106 transactions using ACID N updating documents 25–27 109–110 transactions using BASE – NameNode service 181 web standards and 27 28 – – Namespace-based Validation 107 108 defined 4 6 Dispatching Language. natural language processing. pros and cons 20 – See NVDL See NLP search types 156 158 namespaces Neo4j query language vs. RDBMS 18 defined 98 data stores 6 NUMERIC type 47 disadvantages of 99 high availability 190 NVDL (Namespace-based Vali- purpose of 98 resources for 95 dation Dispatching National Institute of Standards Netflix 133 Language) 112 and Technology. See NIST NetKernel – native XML databases assembling content 219 220 O collection hierarchies optimizing component – customizing update behav- regeneration 220 222 object-relational mapping 97, ior with database network search 157 194–195 triggers 102 NetworkTopologyStrategy objects, defined 212 defined 102 value 186 ODS (operational data store) grouping documents by n-gram search 158 120 access NHibernate 8 Office of the Historian at permissions 103–105 NIST (National Institute of Department of State case storing documents of dif- Standards and study 115–119 ferent types 103 Technology) 244 OLAP (online analytical pro- defined 97–100 NLP (natural language cessing) financial derivatives case processing) 77 ad hoc reporting using study nodes aggregates 55–57 business benefits 121–122 defined 21 and in-database security project results 122 in graph store 72 235–236 RDBMS system is node objects 59 concepts of 54–55 , disadvantage 119 NonStop system 173 191 data flow from operational switching from multiple NoSQL systems to analytical RDBMSs to one native business drivers systems 52–53 XML system 119–121 agility 8 vs. OLTP 51–52 flexibility of 103 variability 7 OLTP (online transaction loading data using drag-and- velocity 7 processing) 51–52 drop 101–102 volume 7 ON COLUMNS statement 54 Office of the Historian at case studies ON ROWS statement 54 – Department of State case Amazon Dynamo 11 12 ONE consistency level 184 – study 115–119 comparison 8 13 open source software 265 validating data structure Google Bigtable 11 OpenOffice 3 204 – using Schematron to check Google MapReduce 10 11 operational costs 67–68, 122 documents 113–115 LiveJournal Memcache 9 operational data store. XML Schema 112–113 MarkLogic 12 See ODS XML standards 110–112 concepts operational source systems 52 – XPath query application tiers 17 21 operational systems 52–53 – expressions 105–106 CAP theorem 30 32 OR operations 249 XQuery consistent hashing to keep Orbeon 204 – advantages of 108 cache current 22 24 OTP module collection extending with custom horizontal scalability with 227–228 – modules 115 sharding 28 30 outside consultants 259 flexibility of 106–107 simplicity of components over-the-counter derivative – full-text searches 110 15 17 contracts 119
www.it-ebooks.info INDEX 283 owner, UNIX permission multiple, and joins 45 RDBMS (relational database model 241 support for 5 management system) oXygen XML editor 101, 113 productivity study 194 access control 48–49 Prolog 229 Boolean search in 156 P proof-of-architecture pilot proj- data definition language 47 ect. See POA pilot project defined 4
www.it-ebooks.info 284 INDEX replace operation 109 S stemming 160 replication structured search 155–156 defined 30 SAN (storage area network) synonym expansion 160 in HDFS 180 136, 181 technical documents in RDBMSs 49–51 SASL (Standard Authentica- overview 166–167 resources for 61 tion and Security Layer) retaining document struc- using to scale reads 145–146 228, 239 ture in document vs. sharding 50 scaling store 167–168 request for proposal. See RFP architecture trade-off using XQuery 110 Resource Description Format. analysis 264 wildcards in 160 See RDF availability 133 secure shell. See SSH resource keys 94 functional programming Secured Socket Layer. See SSL resource, UNIX permission and 216–217 security model 241 imperative programming Apache Accumulo key resource-oriented computing. and 216–217 visibility 249–250 See ROC independent application-level vs. database- REST (Representational State transformations 132 level 236–237 Transfer) of column family stores 84 architecture trade-off and Amazon S3 71 of key-value stores 67 analysis 265 defined 69 reads 132 auditing 242–243 in XRX 201 totals 132 authentication reusing data formats 97 writes 132 basic access authentication reverse indexes Schema 215 238 creating using MapReduce schemaless, defined 112 digest authentication 238 164–165 Schematron Kerberos protocol defined 159 checking documents using authentication 239 in-node indexes and 164 113–115 multifactor authentication revision control systems 58–60 standards 111 239 RFP (request for proposal) 257 SDLC (software development overview 237–238 Riak lifecycle) 170, 193 public key authentication Amazon DynamoDB and 182 searchability 265 239 data stores 6 searching SASL 239 Erlang 222 Boolean search 155–156 authorization RIPEMD-160 algorithm 23 boosting rankings 162 overview 240–241 risk management 122, 270 document structure UNIX permission ROC (resource-oriented and 161–162 model 241–242 computing) 219 domain-specific languages using roles 242 role hierarchy 250 168–169 denial of service attacks role-based access control. effectivity of 158–161 245–246 See RBAC entity extraction 160 digital signatures 244–245 role-based contextualization full-text keyword search encryption 243–245 201 155–156 in Amazon S3 roles, authorization using 242 indexes Access-control lists 247 rollback operations 46 creating reverse indexes bucket policies 248–249 rowkey 185 using MapReduce Identity and Access rows 39 164–165 Management 247 row-store pattern in-node vs. remote 163–164 overview 246–247 evolving with keyword density 159 injection attacks 245–246 organization 41–42 KWIC 160 logging 242–243 example using joins 43–45 measuring quality 162–163 MarkLogic RBAC model overview 39–41 misspelled words 161 advantages of 252 pros and cons 42–43 NoSQL search types 156–158 overview 250 Ruby 209 overview 155 protecting documents Ruby on Rails Web Application proximity search 160 250–251 Framework 200, 206 ranking 159 secure publishing using rules, for graph stores 78–79 resources for 171 251–252
www.it-ebooks.info INDEX 285 security (continued) query language 79 T model for resources for 95 overview 233–234 sparse matrix 83 Tandem Computers 173, 191 using data warehouses split partition 32 teams for architecture trade-off 235–236 SQL databases. See RDBMS analysis using services 235 SSDs (solid state drives) experience bias 259 resources 253 Amazon DynamoDB and 183 selecting 258–259 XML signatures 244–245 customizing stores 92 using outside consultants SEI (Software Engineering resources 33 259 Institute) 262 strategic use of 21–22 technical documents, searching Semantic Web Stack 78 SSH (secure shell) 239 overview 166–167 semi-structured data SSL (Secure Sockets retaining document structure defined 155 Layer) 238 in document store 167–168 search for 156 SSO (single sign-on) 238 TEI (Text Encoding Initiative) semi-structured search 157 stakeholders, communicating 116, 123 separation of concerns 18 with TEXT type 47 serial processing 213 example of 269–270 THREE consistency level 184 service level agreement. See SLA overview 267 three nines 174 service levels 66 project risks 270 TIMESTAMP type 47 service monitoring 67 using quality trees as naviga- TINYBLOB type 47 services, security model 235 tional maps 267–269 TINYINT type 47 SET type 47 Standard Authentication and
www.it-ebooks.info 286 INDEX unit testing 225–226 storing in key-value XQuery 68 UNIX permission model 103, stores 70–71 advantages of 108 241–242 WebDAV protocol 102 extending with custom UNIX pipes 16 wildcards 160 modules 115 unstructured data 156 World Wide Web consortium. flexibility of 106–107 update trigger type 102 See W3C full-text searches 110 upper level ontology 83 functional programming Urika appliance 151 X language 107 URIs (uniform resource in XRX 201 identifiers) 219 xar files 88 JSON and 206 defined 74 XDCR (cross data center overview 106 vs. URLs 75 replication) 187–188 resources for 123 URLs (uniform resource loca- XForms SQL and 107 tors) defined 201 standards 111 defined 70 overview 204 updating documents – vs. URIs 75 standards 111 109 110 – UTF-8 encoding 244 XML (Extensible Markup Lan- web standards and 107 108 guage) XRL files 220 V as column type in XRX web application architec- RDBMS 100 ture validating disadvantage of 100 agility and 205 complex business forms 201 searching documents 156 complex business forms – XML standards 110–112 201 202 – using Schematron to check uses for 98 replacing JavaScript 202 205 documents 113–115 validating standards represented by XML Schema 112–113 using Schematron to check 201 VARCHAR type 47 documents 113–115 xs: variability business driver 7 XML Schema 112–113 as prefix 69 vBucket 188 vs. CSV 98 positiveInteger data type 113 vector search 157–158 vs. HTML 100 XSL-FO standards 112 velocity business driver 7 vs. JSON 98 XSLT (Extensible Stylesheet version control 58 See also native XML databases Language views XML Schema 112–113 Transformations) 222 for XRX 202 resources 123 XSLTForms 204 , in RDBMSs 48 standards 111 XSPARQL 109 123 virtual bucket. See vBucket XML signatures 244–245 Visibility field, Apache Accu- XMLA (XML for Analysis) 56 Y mulo 249 XML-RPC interface 102 volume business driver 7 XMLSH Yahoo! 11, 155 defined 17 YarcData 151, 153 W resources 33 XPath Z – W3C (World Wide Web overview 105 106 consortium) 68, 106, 244 standards 111 ZERO consistency level 184 WARNING level 147 XProc 107 zero-translation weak consistency 184 defined 17 architecture 203 web pages resources 33 big data uses 130 standards 111
www.it-ebooks.info www.it-ebooks.info