The Art of Monitoring

Version: v1.0.3 (2c4a7d0) 1 The Art of Monitoring James Turnbull September 16, 2016 Version: v1.0.3 (2c4a7d0) Website: The Art of Monitoring Some rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical or photocopying, recording, or otherwise, for commercial purposes without the prior permission of the publisher. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit here. © Copyright 2016 - James Turnbull <[email protected]> Contents Page Foreword 1 Who is this book for? ............................... 1 Credits and Acknowledgments ......................... 1 Technical Reviewers ............................... 2 Caitie McCaffrey ................................ 2 Paul Stack .................................... 2 Jamie Wilkinson ................................ 2 Editor ........................................ 3 Author ........................................ 3 Conventions in the book ............................. 3 Code and Examples ................................ 4 Colophon ...................................... 4 Errata ........................................ 4 Disclaimer ...................................... 4 Copyright ...................................... 5 Version ....................................... 6 Chapter 1 Introduction 7 Welcome to the Art of Monitoring ....................... 8 What is monitoring? ............................... 9 The business as a customer ......................... 9 Information Technology as a customer .................. 10 What does monitoring actually look like? ................... 10 i Contents Manual, user-initiated, or no monitoring ................. 11 Reactive ..................................... 11 Proactive .................................... 12 Model distribution ................................. 13 Becoming Proactive ................................ 15 What’s in the book? ................................ 18 Tool choices .................................... 19 Chapter 2 A Monitoring and Measurement Framework 21 Blackbox versus Whitebox, or Pull versus Push ............... 23 Event, log, and metric-centered ......................... 26 More about metrics .............................. 26 So what’s a metric? .............................. 27 Types of metrics ................................ 29 Metric summaries ............................... 31 Metric aggregation .............................. 32 Contextual and useful notifications ....................... 33 Visualization .................................... 36 So why this architecture? What’s wrong with traditional monitoring? . 37 Static configuration .............................. 38 Inflexible logic and thresholds ....................... 38 Object-centric ................................. 39 An interlude into pets and cattle ...................... 40 So what do we do differently? ....................... 41 Smarter threshold inputs ........................... 42 Collecting data for our monitoring framework ................ 53 Overhead and the observer effect ..................... 54 Summary ...................................... 54 Chapter 3 Managing events and metrics with Riemann 55 Introducing Riemann ............................... 56 Riemann architecture and implementation ................ 57 Version: v1.0.3 (2c4a7d0) ii Contents Installing Riemann .............................. 58 Configuring Riemann ............................... 67 Learning some Clojure ............................ 67 Riemann’s base configuration ........................ 68 Events, streams, and the index ....................... 73 Configuring events, streams, and the index ................ 77 Sending our first event to Riemann .................... 82 Creating our first Riemann monitoring check .............. 85 An interlude into Riemann filtering .................... 87 Connecting Riemann servers ........................... 90 Configuring the upstream Riemann servers ................ 92 Configuring the downstream Riemann server .............. 97 Enabling the send of our Riemann events downstream ......... 97 Alerting on the upstream Riemann servers .................. 100 Throttling Riemann events .......................... 109 Rolling up Riemann events ......................... 110 Alternatives to email notifications ..................... 111 Testing your Riemann configuration ...................... 111 Validating Riemann configuration ....................... 116 Performance, scaling, and making Riemann highly available ....... 117 Alternatives to Riemann ............................. 120 Summary ...................................... 121 Chapter 4 Introducing Graphite and Grafana 122 Introducing Graphite ............................... 123 Carbon ...................................... 124 Whisper ..................................... 124 Graphite Web, Graphite-API, and Grafana ................ 125 Graphite architecture ............................... 126 Installing Graphite ................................. 126 Installing Graphite on Ubuntu ....................... 129 Version: v1.0.3 (2c4a7d0) iii Contents Installing Graphite on Red Hat ....................... 130 Installing Graphite-API ............................ 133 Installing Grafana ............................... 136 Installing Graphite and Grafana via configuration management ... 139 Configuring Graphite and Carbon ....................... 140 Configuring Carbon’s metric retention ................... 148 Estimating Graphite storage ......................... 153 Carbon and Graphite service management ................ 155 Configuring Graphite-API ............................ 162 Service management for Graphite-API ................... 165 Testing the Graphite-API ........................... 167 Configuring Grafana ................................ 168 Configuring Riemann for Graphite ....................... 174 A brief introduction to Grafana ......................... 183 Graphite and Carbon Redundancy ....................... 191 Time and time zones ............................... 196 Managing time manually ........................... 197 Managing Time via configuration management ............. 202 Checking the time status ........................... 203 Alternatives to Graphite and Grafana ..................... 204 Commercial tools ............................... 204 Open-source tools ............................... 204 Whisper alternatives ............................... 205 InfluxDB ..................................... 206 Cyanite ..................................... 206 Summary ...................................... 206 Chapter 5 Host monitoring 208 Introducing collectd ................................ 210 What host components should we monitor? ................. 213 Installing collectd ................................. 214 Version: v1.0.3 (2c4a7d0) iv Contents Installing collectd on Ubuntu ........................ 215 Installing collectd on Red Hat ........................ 216 Installing collectd via configuration management ............ 217 Configuring collectd ................................ 218 Loading and configuring collectd plugins for monitoring ....... 224 Finishing up .................................. 247 Enabling and running collectd ....................... 247 The collectd events ................................ 248 Sending our collectd events to Graphite .................... 252 Refactoring the collectd metric names ..................... 253 Summary ...................................... 261 Chapter 6 Using collectd events in Riemann 262 Checking processes are running ......................... 263 Other actions and enhancements ........................ 270 Replicating some classic monitoring ...................... 271 Better monitoring through smarter data .................... 274 Building a median-based check ....................... 274 Using percentiles for host-based checks .................. 276 Creating check abstractions ......................... 279 Organizing our checks ............................ 286 Graphing collectd metrics with Grafana .................... 287 Creating the Hosts dashboard ........................ 288 Creating our first host graph ........................ 289 Creating a memory graph .......................... 294 Single host graphs ............................... 296 Additional graphs ............................... 297 Network, device, and Microsoft Windows monitoring ........... 299 Alternatives to collectd .............................. 299 Commercial tools ............................... 299 Open source .................................. 300 Version: v1.0.3 (2c4a7d0) v Contents Summary ...................................... 301 Chapter 7 Containers: another kind of host 302 Challenges with container monitoring ..................... 303 Monitoring Docker containers .......................... 307 Docker collectd plugin ............................ 310 Installing the Docker collectd plugin .................... 311 Configuring the Docker collectd plugin .................. 313 Processing Docker collectd statistics with Riemann ............. 316 Adding metadata to our Docker events .................. 324 Specifying different resolution for Docker metrics .............. 334 Cleaning up old Graphite Docker metrics ................... 337 Using Docker metrics for monitoring ..................... 338 Other container monitoring tools .......................

Load more