The Site Reliability Workbook Practical Ways to Implement SRE

The Site Reliability Workbook Practical Ways to Implement SRE

PresentedLaunch by Google Day Edition Cloud BestsellingCompanion SRE Bookto the The Site Reliability Workbook Practical Ways to Implement SRE Edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara & Stephen Thorne Praise for The Site Reliability Workbook This new workbook will help people to take the sometimes theoretical and abstract concepts covered in Site Reliability Engineering out of the special context of the Googleplex and see how the same concepts work in other organizations. I’m especially excited to see more detail in the analysis of toil, how to apply SRE principles to data pipelines, and the case study reports discussing practical service level management. —Kurt Andersen, Site Reliability Engineer, LinkedIn This practical hands-on guide to implementing SRE is valuable for engineers at companies of all sizes. It’s excellent to see this workbook being shared so that we can all move forward and build more reliable systems together. I was impressed with the level of detail shared; you can pick this book up and get started implementing SRE practices today. —Tammy Bütow, Principal SRE, Gremlin A timely reminder, from the team that made SRE a required practice for everyone operating at scale, that reliability is created by people. This book is full of practical examples of how to optimize for reliability by focusing on the interactions between users and engineers and between technology and tools, without losing sight of feature velocity. The result is a compelling, interesting, and thought-provoking companion to Site Reliability Engineering. —Casey Rosenthal, CTO, Backplane.io Google’s first book explained the what and why of SRE. This book shows you how to implement SRE at any company, startup or giant. Great work by the editorial team. —Jonah Horowitz, SRE at Stripe In 2016, Google dropped Site Reliability Engineering on the operations world, and the operations world was never the same. For the first time people had access to over 500 pages of distilled information on what Google does to run its planet-wide infrastructure. Most people liked the book, a handful didn’t, but nobody ignored it. It became a seminal work and an important touchstone for how people thought about SRE (especially the Google implementation of it) from that point on. But it was missing something…. Now in 2018, Google returns to fill in a crucial piece of the puzzle: in their first volume they described what they do, but that didn’t help those who couldn’t see themselves in Google’s story. This book aims to demonstrate how Google does SRE— and how you can do it, too. —David N. Blank-Edelman, editor of Seeking SRE: Conversations about Running Production Systems at Scale and cofounder of the global set of SREcon conferences The Site Reliability Workbook Practical Ways to Implement SRE Edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, and Stephen Thorne Beijing Boston Farnham Sebastopol Tokyo The Site Reliability Workbook edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, and Stephen Thorne Copyright © 2018 Google LLC. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Nikki McDonald Indexer: Ellen Troutman-Zaig Developmental Editor: Virginia Wilson Interior Designer: David Futato Production Editor: Kristen Brown Cover Designer: Karen Montgomery Copyeditor: Rachel Monaghan Illustrator: Rebecca Demarest Proofreader: Kim Cofer August 2018: First Edition Revision History for the First Edition 2018-06-08: First Release 2018-06-22: Second Release See http://oreilly.com/catalog/errata.csp?isbn=9781492029502 for release details. This work is part of a collaboration between O’Reilly and Google. See our statement of editorial independ‐ ence. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Site Reliability Workbook, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-02950-2 [LSCH] Table of Contents Foreword I. xvii Foreword II. xix Preface. xxiii 1. How SRE Relates to DevOps. 1 Background on DevOps 2 No More Silos 2 Accidents Are Normal 3 Change Should Be Gradual 3 Tooling and Culture Are Interrelated 3 Measurement Is Crucial 4 Background on SRE 4 Operations Is a Software Problem 4 Manage by Service Level Objectives (SLOs) 5 Work to Minimize Toil 5 Automate This Year’s Job Away 6 Move Fast by Reducing the Cost of Failure 6 Share Ownership with Developers 6 Use the Same Tooling, Regardless of Function or Job Title 7 Compare and Contrast 7 Organizational Context and Fostering Successful Adoption 9 Narrow, Rigid Incentives Narrow Your Success 9 It’s Better to Fix It Yourself; Don’t Blame Someone Else 10 Consider Reliability Work as a Specialized Role 10 When Can Substitute for Whether 11 Strive for Parity of Esteem: Career and Financial 12 v Conclusion 12 Part I. Foundations 2. Implementing SLOs. 17 Why SREs Need SLOs 17 Getting Started 18 Reliability Targets and Error Budgets 19 What to Measure: Using SLIs 20 A Worked Example 23 Moving from SLI Specification to SLI Implementation 25 Measuring the SLIs 26 Using the SLIs to Calculate Starter SLOs 28 Choosing an Appropriate Time Window 29 Getting Stakeholder Agreement 30 Establishing an Error Budget Policy 31 Documenting the SLO and Error Budget Policy 32 Dashboards and Reports 33 Continuous Improvement of SLO Targets 34 Improving the Quality of Your SLO 35 Decision Making Using SLOs and Error Budgets 37 Advanced Topics 38 Modeling User Journeys 39 Grading Interaction Importance 39 Modeling Dependencies 40 Experimenting with Relaxing Your SLOs 41 Conclusion 42 3. SLO Engineering Case Studies. 43 Evernote’s SLO Story 43 Why Did Evernote Adopt the SRE Model? 44 Introduction of SLOs: A Journey in Progress 45 Breaking Down the SLO Wall Between Customer and Cloud Provider 48 Current State 49 The Home Depot’s SLO Story 49 The SLO Culture Project 50 Our First Set of SLOs 52 Evangelizing SLOs 54 Automating VALET Data Collection 55 The Proliferation of SLOs 57 Applying VALET to Batch Applications 57 vi | Table of Contents Using VALET in Testing 58 Future Aspirations 58 Summary 59 Conclusion 60 4. Monitoring. 61 Desirable Features of a Monitoring Strategy 62 Speed 62 Calculations 62 Interfaces 63 Alerts 64 Sources of Monitoring Data 64 Examples 65 Managing Your Monitoring System 67 Treat Your Configuration as Code 67 Encourage Consistency 68 Prefer Loose Coupling 68 Metrics with Purpose 69 Intended Changes 70 Dependencies 70 Saturation 71 Status of Served Traffic 72 Implementing Purposeful Metrics 72 Testing Alerting Logic 72 Conclusion 73 5. Alerting on SLOs. 75 Alerting Considerations 75 Ways to Alert on Significant Events 76 1: Target Error Rate ≥ SLO Threshold 76 2: Increased Alert Window 78 3: Incrementing Alert Duration 79 4: Alert on Burn Rate 80 5: Multiple Burn Rate Alerts 82 6: Multiwindow, Multi-Burn-Rate Alerts 84 Low-Traffic Services and Error Budget Alerting 86 Generating Artificial Traffic 87 Combining Services 87 Making Service and Infrastructure Changes 87 Lowering the SLO or Increasing the Window 88 Extreme Availability Goals 89 Alerting at Scale 89 Table of Contents | vii Conclusion 91 6. Eliminating Toil. 93 What Is Toil? 94 Measuring Toil 96 Toil Taxonomy 98 Business Processes 98 Production Interrupts 99 Release Shepherding 99 Migrations 99 Cost Engineering and Capacity Planning 100 Troubleshooting for Opaque Architectures 100 Toil Management Strategies 101 Identify and Measure Toil 101 Engineer Toil Out of the System 101 Reject the Toil 101 Use SLOs to Reduce Toil 102 Start with Human-Backed Interfaces 102 Provide Self-Service Methods 102 Get Support from Management and Colleagues 103 Promote Toil Reduction as a Feature 103 Start Small and Then Improve 103 Increase Uniformity 103 Assess Risk Within Automation 104 Automate Toil Response 104 Use Open Source and Third-Party Tools 105 Use Feedback to Improve 105 Case Studies 106 Case Study 1: Reducing Toil in the Datacenter with Automation 107 Background 107 Problem Statement 110 What We Decided to Do 110 Design First Effort: Saturn Line-Card Repair 110 Implementation 111 Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair 113 Implementation 114 Lessons Learned 118 Case Study 2: Decommissioning Filer-Backed Home Directories 121 Background 121 Problem Statement 121 What We Decided to Do 122 viii | Table of Contents Design and Implementation 123 Key Components 124 Lessons Learned 127 Conclusion 129 7. Simplicity. 131 Measuring Complexity 131 Simplicity Is End-to-End, and SREs Are Good for That 133 Case Study 1: End-to-End API Simplicity 134 Case Study 2: Project Lifecycle Complexity 134 Regaining Simplicity 135 Case Study 3: Simplification of the Display Ads Spiderweb 137 Case Study 4: Running Hundreds of Microservices on a Shared Platform 139 Case Study 5: pDNS No Longer Depends on Itself 140 Conclusion 141 Part II.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    508 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us