Distributed Tracing in Practice Instrumenting, Analyzing, and Debugging Microservices

Distributed Tracing in Practice Instrumenting, Analyzing, and Debugging Microservices Austin Parker, Daniel Spoonhower, Jonathan Mace, and Rebecca Isaacs with Ben Sigelman Beijing Boston Farnham Sebastopol Tokyo Distributed Tracing in Practice by Austin Parker, Daniel Spoonhower, Jonathan Mace, and Rebecca Isaacs, with Ben Sigelman Copyright © 2020 Ben Sigelman, Austin Parker, Daniel Spoonhower, Jonathan Mace, and Rebecca Isaacs. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: John Devins Indexer: Sue Klefstad Development Editor: Sarah Grey Interior Designer: David Futato Production Editor: Katherine Tozer Cover Designer: Karen Montgomery Copyeditor: Chris Morris Illustrator: Rebecca Demarest Proofreader: JM Olejarz April 2020: First Edition Revision History for the First Edition 2020-04-13 First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492056638 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Distributed Tracing in Practice, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Lightstep. See our statement of editorial inde‐ pendence. 978-1-492-09503-3 [LSI] Table of Contents Foreword. xi Introduction: What Is Distributed Tracing?. xv 1. The Problem with Distributed Tracing. 1 The Pieces of a Distributed Tracing Deployment 3 Distributed Tracing, Microservices, Serverless, Oh My! 4 The Benefits of Tracing 6 Setting the Table 7 2. An Ontology of Instrumentation. 9 White Box Versus Black Box 10 Application Versus System 13 Agents Versus Libraries 15 Propagating Context 16 Interprocess Propagation 18 Intraprocess Propagation 20 The Shape of Distributed Tracing 23 Tracing-Friendly Microservices and Serverless 23 Tracing in a Monolith 25 Tracing in Web and Mobile Clients 27 3. Open Source Instrumentation: Interfaces, Libraries, and Frameworks. 31 The Importance of Abstract Instrumentation 32 OpenTelemetry 34 OpenTracing and OpenCensus 43 OpenTracing 43 OpenCensus 48 v Other Notable Formats and Projects 53 X-Ray 53 Zipkin 54 Interoperability and Migration Strategies 54 Why Use Open Source Instrumentation? 57 Interoperability 58 Portability 58 Ecosystem and Implicit Visibility 59 4. Best Practices for Instrumentation. 61 Tracing by Example 61 Installing the Sample Application 62 Adding Basic Distributed Tracing 62 Custom Instrumentation 70 Where to Start—Nodes and Edges 71 Framework Instrumentation 72 Service Mesh Instrumentation 75 Creating Your Service Graph 76 What’s in a Span? 79 Effective Naming 79 Effective Tagging 80 Effective Logging 81 Understanding Performance Considerations 82 Trace-Driven Development 85 Developing with Traces 86 Testing with Traces 89 Creating an Instrumentation Plan 91 Making the Case for Instrumentation 91 Instrumentation Quality Checklist 93 Knowing When to Stop Instrumenting 95 Smart and Sustainable Instrumentation Growth 96 5. Deploying Tracing. 99 Organizational Adoption 100 Start Close to Your Users 100 Start Centrally: Load Balancers and Gateways 101 Leverage Infrastructure: RPC Frameworks and Service Meshes 102 Make Adoption Repeatable 103 Tracer Architecture 104 In-Process Libraries 105 Sidecars and Agents 106 Collectors 107 vi | Table of Contents Centralized Storage and Analysis 108 Incremental Deployment 109 Data Provenance, Security, and Federation 110 Frontend Service Telemetry 110 Server-Side Telemetry for Managed Services 114 6. Overhead, Costs, and Sampling. 117 Application Overhead 118 Latency 118 Throughput 120 Infrastructure Costs 122 Network 122 Storage 123 Sampling 124 Minimum Requirements 124 Strategies 126 Selecting Traces 130 Off-the-Shelf ETL Solutions 131 7. A New Observability Scorecard. 135 The Three Pillars Defined 136 Metrics 136 Logging 138 Distributed Tracing 139 Fatal Flaws of the Three Pillars 140 Design Goals 141 Assessing the Three Pillars 142 Three Pipes (Not Pillars) 144 Observability Goals and Activities 145 Two Goals in Observability 145 Two Fundamental Activities in Observability 146 A New Scorecard 148 The Path Ahead 152 8. Improving Baseline Performance. 153 Measuring Performance 154 Percentiles 156 Histograms 158 Defining the Critical Path 160 Approaches to Improving Performance 163 Individual Traces 163 Biased Sampling and Trace Comparison 165 Table of Contents | vii Trace Search 167 Multimodal Analysis 169 Aggregate Analysis 171 Correlation Analysis 173 9. Restoring Baseline Performance. 179 Defining the Problem 180 Human Factors 182 (Avoiding) Finger-Pointing 182 “Suppressing” the Messenger 183 Incident Hand-off 184 Good Postmortems 184 Approaches to Restoring Performance 185 Integration with Alerting Workflows 185 Individual Traces 186 Biased Sampling 187 Real-Time Response 189 Knowing What’s Normal 191 Aggregate and Correlation Root Cause Analysis 195 10. Are We There Yet? The Past and Present. 201 Distributed Tracing: A History of Pragmatism 202 Request-Based Systems 202 Response Time Matters 202 Request-Oriented Information 203 Notable Work 203 Pinpoint 204 Magpie 204 X-Trace 206 Dapper 207 Where to Next? 208 11. Beyond Individual Requests. 209 The Value of Traces in Aggregate 211 Example 1: Is Network Congestion Affecting My Application? 211 Example 2: What Services Are Required to Serve an API Endpoint? 211 Organizing the Data 212 A Strawperson Solution 212 What About the Trade-offs? 214 Sampling for Aggregate Analysis 214 The Processing Pipeline 215 Incorporating Heterogeneous Data 217 viii | Table of Contents Custom Functions 217 Joining with Other Data Sources 218 Recap and Case Study 219 The Value of Traces in Aggregate 219 Organizing the Data 220 Sampling for Aggregate Analysis 220 The Processing Pipeline 221 Incorporating Heterogeneous Data 221 12. Beyond Spans. 223 Why Spans Have Prevailed 223 Visibility 223 Pragmatism 224 Portability 224 Compatibility 225 Flexibility 225 Why Spans Aren’t Enough 225 Graphs, Not Trees 226 Inter-Request Dependencies 227 Decoupled Dependencies 228 Distributed Dataflow 229 Machine Learning 230 Low-Level Performance Metrics 231 New Abstractions 232 Seeing Causality 234 13. Beyond Distributed Tracing. 237 Limitations of Distributed Tracing 238 Challenge 1: Anticipating Problems 239 Challenge 2: Completeness Versus Costs 240 Challenge 3: Open-Ended Use Cases 240 Other Tools Like Distributed Tracing 241 Census 241 A Motivating Example 242 A Distributed Tracing Solution? 243 Tag Propagation and Local Metric Aggregation 244 Comparison to Distributed Tracing 245 Pivot Tracing 246 Dynamic Instrumentation 246 Recurring Problems 247 How Does It Work? 247 Dynamic Context 248 Table of Contents | ix Comparison to Distributed Tracing 248 Pythia 249 Performance Regressions 249 Design 251 Overheads 251 Comparison to Distributed Tracing 251 14. The Future of Context Propagation. 253 Cross-Cutting Tools 253 Use Cases 254 Distributed Tracing 254 Cross-Component Metrics 255 Cross-Component Resource Management 255 Managing Data Quality Trade-offs 256 Failure Testing of Microservices 257 Enforcing Cross-System Consistency 258 Request Duplication 258 Record Lineage in Stream Processing Systems 259 Auditing Security Policies 259 Testing in Production 259 Common Themes 260 Should You Care? 260 The Tracing Plane 261 Is Baggage Enough? 262 Beyond Key-Value Pairs 264 Compiling BDL 265 BaggageContext 266 Merging 266 Overheads 266 A. The State of Distributed Tracing Circa 2020. 269 B. Context Propagation in OpenTelemetry. 275 Bibliography. 281 Index. 285 x | Table of Contents Foreword Human beings have struggled to understand production software for exactly as long as human beings have had production software. We have these marvelously fast machines, but they don’t speak our language and—despite their speed and all of the hype about artificial intelligence—they are still entirely unreflective and opaque. For many (many) decades, our efforts to understand production software ultimately boiled down to two types of telemetry data: log data and time series statistics. The time series data—also known as metrics—helped us understand that “something ter‐ rible” was happening inside of our computers. If we were lucky, the logging data would help us understand specifically what that terrible thing was. But then everything changed: our software needed more than just one computer. In fact, it needed thousands of them. We broke the software into tiny,

Load more