Monitoring Systems and POWER5/6 Lpars with Ganglia

Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl – [email protected] Agenda . Ganglia – what is it ? . Ganglia components and data flow . An introduction to RRDTool . Ganglia metrics – what can be measured ? . New POWER5/6 metrics (AIX & Linux) . Extending Ganglia with gmetric . Add device specific information to Ganglia . Ganglia network communication . Installation issues . Where to get Ganglia for AIX and Linux on POWER ? . Best practices . Future additions / plans . Discussion . Links 2 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? Ganglia – what is it ? (1/3) . Ganglia is an Open Source cluster performance monitoring tool and has been extended to include POWER5/6 features like shared processor LPARs, entitlement, physical CPU usage etc. This session covers: – the technical details of Ganglia and the POWER5/6 extensions – how to set it up and use it to monitor all LPARs in a single machine and lots of machines 4 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (2/3) Ganglia properties: . scalable distributed monitoring system for high-performance computing systems such as clusters and grids . based on a hierarchical design targeted at federations of clusters . relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state . leverages widely used technologies such as – XML for data representation – XDR (eXternal Data Representation) for compact, portable data transport – RRDtool for data storage and visualization . uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency . robust implementation . Open Source, written in C – Downloaded 110,000+ times, 145+ countries, 500+ clusters, 2000+ nodes 5 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (3/3) Ganglia properties (cont.): . has been ported to an extensive set of operating systems and processor architectures: – AIX – Darwin – FreeBSD – HP-UX – IRIX – Linux – OSF – NetBSD – Solaris – Windows (via Cygwin) . is currently in use on over 500+ clusters around the world . has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000+ nodes – check http://ganglia.info/ for more details 6 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia components and data flow Ganglia components The ganglia system consists of: . two unique daemons: – Ganglia Monitoring Daemon (gmond) • monitoring daemon, collects the metrics • runs on each node – Ganglia Meta Daemon (gmetad) • polls all gmond clients and stores the collected metrics in Round-Robin Databases (RRDs) . a PHP-based web frontend . a few other small utility programs – gmetric • can be used to easily extend Ganglia with additional user-defined metrics – gstat – gexec 8 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – Schematic View From: “Ganglia: Past, Present and Future” by Matt Massie: URL: http://ganglia.info/talks/lug_lbl_talk/ 9 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Architecture 10 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Monitoring Daemon (gmond) . Ganglia Monitoring Daemon (gmond) is a multi-threaded daemon which runs on each cluster node you want to monitor. Installation is easy: – just the daemon and a configuration file (/etc/gmond.conf) . gmond has four main responsibilities: 1. monitor changes in host state 2. announce relevant changes 3. listen to the state of all other ganglia nodes via a unicast or multicast channel 4. answer requests for an XML description of the cluster state . Each gmond transmits information in two different ways: – unicasting or multicasting host state in external data representation (XDR) format using UDP messages – sending XML over a TCP connection 11 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Meta Daemon (gmetad) (1/2) . Ganglia Meta Daemon (gmetad) is a daemon which typically only runs on one specific cluster node – or on more when using a staged setup. Installation is easy: – just the daemon and a configuration file (/etc/gmetad.conf) . Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters. At each node in the tree a gmetad – periodically polls a collection of child data sources – parses the collected XML – saves all numeric volatile metrics to round-robin databases – exports the aggregated XML over a TCP socket to clients 12 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Meta Daemon (gmetad) (2/2) . Data sources may be either – gmond daemons, representing specific clusters or – other gmetad daemons, representing sets of clusters . Data sources use source IP addresses for access control – Multiple IP addresses can be specified for failover – The capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster 13 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia PHP web frontend (1/2) Web frontend properties: . provides a view of the gathered information via real-time dynamic web pages . displays Ganglia data in a meaningful way for system administrators and users – For example, one can view the CPU utilization over the past hour, day, week, month, or year – The web frontend shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics 14 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia PHP web frontend (2/2) Web frontend properties (cont.): . depends on the existence of the gmetad which provides it with data from several Ganglia sources . opens the local port 8651 (by default) and expects to receive a Ganglia XML tree . the web pages themselves are highly dynamic; any change to the Ganglia data appears immediately on the site – This behavior leads to a very responsive site, but requires that the full XML tree be parsed on every page access – Therefore, the Ganglia web frontend should run on a fairly powerful, dedicated machine if it presents a large amount of data . is written in the PHP scripting language and uses graphs generated by gmetad to display history information . has been tested on many flavors of Unix (primarily Linux) with the Apache web server and the PHP 4.1 module 15 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (1/4) One daemon per node/LPAR gmond Operating System performance stats /etc/gmond.conf API File access Network Web 16 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (2/4) One daemon per node/LPAR Runs on web server gmond gmetad /etc/gmetad.conf rrdtool Operating System performance stats /etc/gmond.conf database API of statistics Browser File access Network Web 17 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (3/4) One daemon per node/LPAR Runs on web server gmond gmetad /etc/gmetad.conf rrdtool Operating System performance stats /etc/gmond.conf database API of statistics Ganglia FE scripts Browser File access Apache2 Network + PHP5 Web 18 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (4/4) User command One daemon per node/LPAR Runs on web server gmetric gmond gmetad /etc/gmetad.conf rrdtool Operating System performance stats /etc/gmond.conf database API of statistics Ganglia FE scripts Browser File access Apache2 Network + PHP5 Web 19 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow again One daemon per node/LPAR Only one instance with the Web Server /etc/gmetad.conf /etc/gmond.conf gmond gmetad rrdtool /etc/gmond.conf gmond database of statistics /etc/gmond.conf gmond PHP scripts Browser Apache2 File access + PHP5 Network Web 20 Monitoring Systems and POWER5/6 LPARs with Ganglia An introduction to RRDTool RRDTool . Homepage: http://oss.oetiker.ch/rrdtool/ . RRD is the Acronym for Round-Robin Database. RRD is a system to store and display time-series data (i.e., network bandwidth, machine-room temperature, server load average). It stores the data in a very compact way that will not expand over time (fixed size of DB), and it presents useful graphs by processing the data to enforce a certain data density. It can be used either via simple wrapper scripts (from shell or Perl) or via frontends that poll network devices and put a friendly user interface on it. RRDTool is the industry standard tool to store and display time-series data! 22 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool example graph Graph taken from http://oss.oetiker.ch/rrdtool/gallery/index.en.html Graph shows inbound and outbound call traffic going in and out of the switch via the 6 trunks connected to the Diamond exchange. Inbound traffic shown as positive and uses a lowest-free fill method. Outbound traffic shown as negative uses a distributed fill method. Tech details on RRDtrac. 23 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool example # rrdtool create test.rrd \ --start 920804400 \ --step 300 \ DS:km:COUNTER:600:U:U \ RRA:AVERAGE:0.5:1:24 # rrdtool update test.rrd 920804700:12345 920805000:12357 920805300:12363 # rrdtool update test.rrd 920805600:12363 920805900:12363 920806200:12373 # rrdtool update test.rrd 920806500:12383 920806800:12393 920807100:12399 # rrdtool update test.rrd 920807400:12405 920807700:12411 920808000:12415 # rrdtool update test.rrd 920808300:12420 920808600:12422 920808900:12423 # rrdtool graph kilometer.png \ --start 920804400 \ --end 920808000 \ DEF:mykm=test.rrd:km:AVERAGE \ LINE2:mykm#FF0000 24 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia metrics – what can be monitored ? Metrics Definition of a metric: . A metric is a certain observed property of the system. Number of metrics: . 34 standard metrics, i.e., available (i.e., defined) on all platforms . Additional platform dependent metrics available – Solaris • 8 additional metrics available – HP-UX • 4 additional metrics available – AIX • 18 additional new metrics available for POWER5/6 !!! • details later….

Monitoring Systems and POWER5/6 Lpars with Ganglia

Monitoring Cluster on Online Compiler with Ganglia

Setup of a Ganglia Monitoring System for a Grid Computing Cluster

Ganglia Users Guide 7.0 Edition Ganglia Users Guide 7.0 Edition Published Dec 01 2017 Copyright © 2017 University of California

Monitoring Systems and Tricks of the Trade

Ganglia Deployment Guide @CERN Crist´Ov˜Aocordeiro, IT-SDC-ID [email protected]

The Ganglia Distributed Monitoring System: Design, Implementation, and Experience

Thirty Billion Metrics a Day: Large-Scale Performance Metrics with Ganglia Adam Compton, Quantcast [email protected] @Comptona

HPC Best Practices: Software Lifecycles San Francisco, California September 28-29, 2009

Ganglia on the Brain

Open Source Software Options for Government

GANGLIA on the BRAIN 55 Specific Purpose in Mind, So Some of What It Isn’T Is by Design

Maestro: a Remote Execution Tool for Visualization Clusters Aron Lee Bierbaum Iowa State University