Database Reliability Engineering DESIGNING and OPERATING RESILIENT DATABASE SYSTEMS

Database Reliability Engineering DESIGNING AND OPERATING RESILIENT DATABASE SYSTEMS Laine Campbell & Charity Majors Database Reliability Engineering Designing and Operating Resilient Database Systems Laine Campbell and Charity Majors Beijing Boston Farnham Sebastopol Tokyo Database Reliability Engineering by Laine Campbell and Charity Majors Copyright © 2018 Laine Campbell and Charity Majors. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editors: Courtney Allen and Virginia Wilson Indexer: Ellen Troutman-Zaig Production Editor: Melanie Yarbrough Interior Designer: David Futato Copyeditor: Bob Russell, Octal Publishing, Inc. Cover Designer: Karen Montgomery Proofreader: Matthew Burgoyne Illustrator: Rebecca Demarest November 2017: First Edition Revision History for the First Edition 2017-10-26: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491925942 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Database Reliability Engineering, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-92594-2 [LSI] Table of Contents Foreword. xi Preface. xiii 1. Introducing Database Reliability Engineering. 1 Guiding Principles of the DBRE 2 Protect the Data 2 Self-Service for Scale 3 Elimination of Toil 4 Databases Are Not Special Snowflakes 5 Eliminate the Barriers Between Software and Operations 5 Operations Core Overview 6 Hierarchy of Needs 7 Survival and Safety 7 Love and Belonging 8 Esteem 9 Self-actualization 10 Wrapping Up 11 2. Service-Level Management. 13 Why Do I Need Service-Level Objectives? 13 Service-Level Indicators 15 Latency 15 Availability 16 Throughput 16 Durability 16 Cost or Efficiency 16 Defining Service Objectives 17 iii Latency Indicators 17 Availability Indicators 20 Throughput Indicators 23 Monitoring and Reporting on SLOs 25 Monitoring Availability 25 Monitoring Latency 28 Monitoring Throughput 28 Monitoring Cost and Efficiency 28 Wrapping Up 29 3. Risk Management. 31 Risk Considerations 32 Unknown Factors and Complexity 32 Availability of Resources 32 Human Factors 33 Group Factors 34 What Do We Do? 35 What Not to Do 35 A Working Process: Bootstrapping 36 Service Risk Evaluation 37 Architectural Inventory 39 Prioritization 40 Control and Decision Making 42 Ongoing Iterations 45 Wrapping Up 47 4. Operational Visibility. 49 The New Rules of Operational Visibility 51 Treat OpViz Systems Like BI Systems 52 Distributed Ephemeral Environments Trending to the Norm 52 Store at High Resolutions for Key Metrics 54 Keep Your Architecture Simple 55 An OpViz Framework 56 Data In 57 Telemetry/Metrics 59 Events 60 Logs 60 Data Out 60 Bootstrapping Your Monitoring 61 Is the Data Safe? 63 Is the Service Up? 64 Are the Consumers in Pain? 65 iv | Table of Contents Instrumenting the Application 66 Distributed Tracing 66 Events and Logs 68 Instrumenting the Server or Instance 68 Events and Logs 70 Instrumenting the Datastore 71 Datastore Connection Layer 71 Utilization 71 Saturation 72 Errors 73 Internal Database Visibility 74 Throughput and Latency Metrics 74 Commits, Redo, and Journaling 75 Replication State 75 Memory Structures 76 Locking and Concurrency 77 Database Objects 78 Database Queries 79 Database Asserts and Events 79 Wrapping Up 80 5. Infrastructure Engineering. 81 Hosts 81 Physical Servers 81 Operating a System and Kernel 82 Storage Area Networks 92 Benefits of Physical Servers 92 Cons of Physical Servers 92 Virtualization 93 Hypervisor 93 Concurrency 94 Storage 94 Use Cases 94 Containers 95 Database as a Service 95 Challenges of DBaaS 96 The DBRE and the DBaaS 96 Wrapping Up 97 6. Infrastructure Management. 99 Version Control 100 Configuration Definition 101 Table of Contents | v Building from Configuration 103 Maintaining Configuration 104 Enforcement of Configuration Definitions 105 Infrastructure Definition and Orchestration 105 Monolithic Infrastructure Definitions 106 Separating Vertically 107 Separated Tiers (Horizontal Definitions) 108 Acceptance Testing and Compliance 109 Service Catalog 109 Bringing It All Together 110 Development Environments 111 Wrapping Up 112 7. Backup and Recovery. 113 Core Concepts 114 Physical versus Logical 114 Online versus Offline 114 Full, Incremental, and Differential 115 Considerations for Recovery 115 Recovery Scenarios 116 Planned Recovery Scenarios 116 Unplanned Scenarios 118 Scenario scope 121 Scenario Impact 121 Anatomy of a Recovery Strategy 122 Building Block 1: Detection 122 Building Block 2: Tiered Storage 124 Building Block 3: A Varied Toolbox 125 Building Block 4: Testing 127 A Recovery Strategy Defined 128 Online, Fast Storage with Full and Incremental Backups 128 Online, Slow Storage with Full and Incremental Backups 129 Offline Storage 130 Object Storage 131 Wrapping Up 132 8. Release Management. 133 Education and Collaboration 133 Become a Funnel 134 Foster Conversations 134 Domain-Specific Knowledge 135 Collaboration 137 vi | Table of Contents Integration 138 Prerequisites 139 Testing 141 Test-Friendly Development Practices 142 Post-Commit Testing 143 Full Dataset Testing 144 Downstream Tests 145 Operational Tests 145 Deployment 146 Migrations and Versioning 146 Impact Analysis 147 Migration Patterns 148 Manual or Automated 151 Wrapping Up 151 9. Security. 153 The Purpose of Security 153 Protecting Data from Theft 154 Protecting from Purposeful Damage 154 Protecting from Accidental Damage 154 Protecting Data from Exposure 155 Compliance and Auditing Standards 155 Database Security as a Function 155 Education and Collaboration 155 Self-Service 156 Integration and Testing 157 Operational Visibility 158 Vulnerabilities and Exploits 160 STRIDE 160 DREAD 161 Basic Precautions 162 Denial of Service 163 SQL Injection 166 Network and Authentication Protocols 168 Encryption of Data 168 Financial Data 169 Personal Health Data 169 Private Individual Data 169 Military or Government Data 170 Confidential/Sensitive Business Data 170 Data in Transit 170 Data in the Database 174 Table of Contents | vii Data in the Filesystem 177 Wrapping Up 179 10. Data Storage, Indexing, and Replication. 181 Data Structure Storage 181 Database Row Storage 182 Sorted-String Tables and Log-Structured Merge Trees 185 Indexing 188 Logs and Databases 189 Data Replication 189 Single-Leader 190 Multi-Leader Replication 203 Wrapping Up 209 11. Datastore Field Guide. 211 Conceptual Attributes of a Datastore 212 The Data Model 212 Transactions 215 BASE 221 Internal Attributes of a Datastore 222 Storage 222 The Ubiquitous CAP Theorem Section 223 Consistency Latency Trade-offs 225 Availability 226 Wrapping Up 227 12. A Data Architecture Sampler. 229 Architectural Components 229 Frontend Datastores 229 Data Access Layer 230 Database Proxies 231 Event and Message Systems 233 Caches and Memory Stores 235 Data Architectures 238 Lambda and Kappa 238 Event Sourcing 241 CQRS 242 Wrapping Up 243 13. Making the Case For DBRE. 245 A Culture of Database Reliability 246 Breaking-Down Barriers 246 viii | Table of Contents Data-Driven Decision Making 251 Data Integrity and Recoverability 252 Wrapping Up 252 Index. 253 Table of Contents | ix Foreword Collectively, we are witnessing a time of unprecedented change and disruption in the database industry. Technology adoption life cycles have accelerated to the point where all of our heads are spinning—with both challenge and opportunity. Architectures are evolving so quickly that the tasks we became accustomed to per‐ forming are no longer required, and the related skills we invested in so heavily are barely relevant. Emerging innovations and pressures in security, Infrastructure as Code, and cloud capabilities (such as Infrastructure and Database as a Service), have allowed us—and required us, actually—to rethink how we build. By necessity, we have moved away from our traditional, administrative workloads to a process emphasizing architecture, automation, software engineering, continuous integration and delivery, and systems instrumentation skills, above all. Meanwhile, the value and importance of the data we’ve been protecting and caring for all along has increased by an order of magnitude or more, and we see no chance of a future in which it doesn’t continue to increase in value. We find ourselves in the fortunate posi‐ tion of being able to make a meaningful, important difference in the world. Without a doubt, many of us who once considered ourselves outstanding database administrators are at risk of being overwhelmed or even

Load more