Prometheus up & Running INFRASTRUCTURE and APPLICATION PERFORMANCE MONITORING

Prometheus Up & Running INFRASTRUCTURE AND APPLICATION PERFORMANCE MONITORING Brian Brazil Prometheus: Up & Running Infrastructure and Application Performance Monitoring Brian Brazil Beijing Boston Farnham Sebastopol Tokyo Prometheus: Up & Running by Brian Brazil Copyright © 2018 Robust Perception Ltd. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Virginia Wilson Indexer: Ellen Troutman-Zaig Production Editor: Nicholas Adams Interior Designer: David Futato Copyeditor: Christina Edwards Cover Designer: Karen Montgomery Proofreader: Sonia Saruba Illustrator: Rebecca Demarest Tech Reviewers: Julius Volz, Carl Bergquist, Andrew McMillan, and Greg Stark July 2018: First Edition Revision History for the First Edition 2018-07-06: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492034148 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Prometheus: Up & Running, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-03414-8 [LSI] Table of Contents Preface. xi Part I. Introduction 1. What Is Prometheus?. 3 What Is Monitoring? 4 A Brief and Incomplete History of Monitoring 5 Categories of Monitoring 7 Prometheus Architecture 10 Client Libraries 11 Exporters 12 Service Discovery 13 Scraping 13 Storage 14 Dashboards 14 Recording Rules and Alerts 15 Alert Management 15 Long-Term Storage 16 What Prometheus Is Not 16 2. Getting Started with Prometheus. 17 Running Prometheus 17 Using the Expression Browser 21 Running the Node Exporter 26 Alerting 30 iii Part II. Application Monitoring 3. Instrumentation. 39 A Simple Program 39 The Counter 41 Counting Exceptions 43 Counting Size 45 The Gauge 45 Using Gauges 46 Callbacks 48 The Summary 48 The Histogram 50 Buckets 51 Unit Testing Instrumentation 54 Approaching Instrumentation 55 What Should I Instrument? 55 How Much Should I Instrument? 57 What Should I Name My Metrics? 58 4. Exposition. 61 Python 62 WSGI 62 Twisted 63 Multiprocess with Gunicorn 64 Go 67 Java 68 HTTPServer 68 Servlet 69 Pushgateway 71 Bridges 74 Parsers 75 Exposition Format 76 Metric Types 77 Labels 77 Escaping 78 Timestamps 78 check metrics 79 5. Labels. 81 What Are Labels? 81 Instrumentation and Target Labels 82 Instrumentation 83 iv | Table of Contents Metric 84 Multiple Labels 85 Child 85 Aggregating 87 Label Patterns 88 Enum 88 Info 90 When to Use Labels 92 Cardinality 93 6. Dashboarding with Grafana. 97 Installation 98 Data Source 99 Dashboards and Panels 101 Avoiding the Wall of Graphs 102 Graph Panel 102 Time Controls 104 Singlestat Panel 105 Table Panel 107 Template Variables 108 Part III. Infrastructure Monitoring 7. Node Exporter. 115 CPU Collector 116 Filesystem Collector 117 Diskstats Collector 118 Netdev Collector 119 Meminfo Collector 119 Hwmon Collector 120 Stat Collector 121 Uname Collector 121 Loadavg Collector 122 Textfile Collector 122 Using the Textfile Collector 123 Timestamps 125 8. Service Discovery. 127 Service Discovery Mechanisms 128 Static 129 File 130 Table of Contents | v Consul 132 EC2 134 Relabelling 135 Choosing What to Scrape 136 Target Labels 139 How to Scrape 146 metric_relabel_configs 148 Label Clashes and honor_labels 151 9. Containers and Kubernetes. 153 cAdvisor 153 CPU 155 Memory 155 Labels 156 Kubernetes 156 Running in Kubernetes 156 Service Discovery 159 kube-state-metrics 168 10. Common Exporters. 169 Consul 169 HAProxy 171 Grok Exporter 174 Blackbox 177 ICMP 178 TCP 181 HTTP 183 DNS 186 Prometheus Configuration 187 11. Working with Other Monitoring Systems. 191 Other Monitoring Systems 191 InfluxDB 193 StatsD 194 12. Writing Exporters. 197 Consul Telemetry 197 Custom Collectors 201 Labels 205 Guidelines 206 vi | Table of Contents Part IV. PromQL 13. Introduction to PromQL. 211 Aggregation Basics 211 Gauge 211 Counter 213 Summary 214 Histogram 215 Selectors 217 Matchers 217 Instant Vector 219 Range Vector 220 Offset 222 HTTP API 223 query 223 query_range 225 14. Aggregation Operators. 229 Grouping 229 without 230 by 231 Operators 232 sum 232 count 233 avg 234 stddev and stdvar 235 min and max 236 topk and bottomk 236 quantile 237 count_values 238 15. Binary Operators. 241 Working with Scalars 241 Arithmetic Operators 242 Comparison Operators 243 Vector Matching 245 One-to-One 246 Many-to-One and group_left 248 Many-to-Many and Logical Operators 251 Operator Precedence 255 Table of Contents | vii 16. Functions. 257 Changing Type 257 vector 258 scalar 258 Math 259 abs 259 ln, log2, and log10 259 exp 260 sqrt 260 ceil and floor 261 round 261 clamp_max and clamp_min 262 Time and Date 262 time 262 minute, hour, day_of_week, day_of_month, days_in_month, month, and year 263 timestamp 264 Labels 265 label_replace 265 label_join 265 Missing Series and absent 266 Sorting with sort and sort_desc 267 Histograms with histogram_quantile 267 Counters 268 rate 268 increase 270 irate 270 resets 271 Changing Gauges 272 changes 272 deriv 272 predict_linear 273 delta 273 idelta 273 holt_winters 274 Aggregation Over Time 274 17. Recording Rules. 277 Using Recording Rules 277 When to Use Recording Rules 280 Reducing Cardinality 280 Composing Range Vector Functions 282 viii | Table of Contents Rules for APIs 282 How Not to Use Rules 283 Naming of Recording Rules 284 Part V. Alerting 18. Alerting. 291 Alerting Rules 292 for 294 Alert Labels 296 Annotations and Templates 298 What Are Good Alerts? 301 Configuring Alertmanagers 302 External Labels 303 19. Alertmanager. 305 Notification Pipeline 305 Configuration File 306 Routing Tree 307 Receivers 314 Inhibitions 324 Alertmanager Web Interface 325 Part VI. Deployment 20. Putting It All Together. 331 Planning a Rollout 331 Growing Prometheus 333 Going Global with Federation 334 Long-Term Storage 337 Running Prometheus 339 Hardware 339 Configuration Management 340 Networks and Authentication 342 Planning for Failure 343 Alertmanager Clustering 346 Meta- and Cross-Monitoring 347 Managing Performance 348 Detecting a Problem 348 Finding Expensive Metrics and Targets 349 Table of Contents | ix Reducing Load 350 Horizontal Sharding 351 Managing Change 352 Getting Help 353 Index. 355 x | Table of Contents Preface This book describes in detail how to use the Prometheus monitoring system to moni‐ tor, graph, and alert on the performance of your applications and infrastructure. This book is intended for application developers, system administrators, and everyone in between. Expanding the Known When it comes to monitoring, knowing that the systems you care about are turned on is important, but that’s not where the real value is. The big wins are in under‐ standing the performance of your systems. By performance I don’t only mean the response time of and CPU used by each request, but the broader meaning of performance. How many requests to the data‐ base are required for each customer order that is processed? Is it time to purchase higher throughput networking equipment? How many machines are your cache misses costing? Are enough of your users interacting with a complex feature in order to justify its continued existence? These are the sort of questions that a metrics-based monitoring system can help you answer, and beyond that help you dig into why the answer is what it is. I see monitor‐ ing as getting insight from throughout your system, from high-level overviews down to the nitty-gritty details that are useful for debugging. A full set of monitoring tools for debugging and analysis includes not only metrics, but also logs, traces, and profil‐ ing; but metrics should be your first port of call when you want to answer systems- level questions. Prometheus encourages you to have instrumentation liberally spread across your sys‐ tems, from applications all the way down to the bare metal. With instrumentation you can observe how all your subsystems and components are interacting, and con‐ vert unknowns into knowns. Preface | xi Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, configuration files, etc.) is available for down‐ load at https://github.com/prometheus-up-and-running/examples. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code.

Load more