Monitoring the Backend and Logging Improvements for Operations and Support

Monitoring the Backend and Logging Improvements for operations and support 27 th EGOWS Helsinki 20 th – 22 nd Sep 2016, Marcus Werner, DWD Overview Stakeholders and a enhancement request from operations & support Goals, Ideas & Challenges Extract, transform, reduce, store and present support information Good (free) available technology stack Non-existing standards EGOWS 2016 - DWD 2 The meteorological workstation system - NinJo Developed by an international Consortium Primary tool for our operational forecasters …and they need a stable and well supported system (7 x 24h) EGOWS 2016 - DWD 3 … typically forecasters ask for meteorological improvements EGOWS 2016 - DWD 4 Technical Infrastructure staff (TI) operates NinJo … …now we have implemented some of their feature requests… … this is finally in the interest of the forecasters. EGOWS 2016 - DWD 5 Our stakeholders End Users: Forecasters (work with meteorological data) Support and operations (work with monitoring systems, log files, admin tools) Adopting from IT operations (7x24h) 1st level support (on call) 2nd level support (office hours) 3rd level support (office hours), might be external EGOWS 2016 - DWD 6 Challenges Different support levels need specific “tailored” support information DWD operates ~200 NinJo Servers ( Routine , Dev/Eval , QA +Special Systems ) Highly distributed environment over multiple DWD locations Up to ~75 server processes per NinJo Site Resource limitations for support and operations (split into multiple teams) EGOWS 2016 - DWD 7 Application Developers vs. Operators /1 st Level Support Operators & 1 st Level Support take care of “many” applications, they will naturally have a limited insight knowledge 3rd Level Support / Appl. Developers highly specialized in the application, they know all the details TI admins do not like to work with application specific graphical tools ( shell ) Log and monitor information has to fulfil the needs of all involved support teams EGOWS 2016 - DWD 8 Goals and ideas (I) Move towards “pro-active” monitoring Reduce reaction times Simplify technical hurdles for administrators Increase automation level ( parsing, correlation) Use similar tools and methods for applications (for us NinJo) as used for data centre & infrastructure components Standardize the log content EGOWS 2016 - DWD 9 Goals and ideas (II) Each support level should get the “right” information (& not more) Support of daily operations and during / after changes being applied (software version, data, hardware, firmware, network, configuration) Different kinds of log information needs to be kept for specific hold times (few days, up to weeks, months and years) We want to keep log data for all operational NinJo Sites for 5 - 7 days (enough to cover for a long weekend) Implement an easier access to log / monitor data (incl. history) EGOWS 2016 - DWD 10 Monitoring vs. log message information (I) A monitoring application is like a (restrict to OK , Warning , Error ) Log files contain mostly (all) details Log files can be used as input for monitoring applications Monitoring applications can be feed via scripts (cron’ed + push ) Monitoring applications can collect information from multiple machines / processes ( pull ) EGOWS 2016 - DWD 11 Monitoring vs. log message information (II) Dump Transport Filter Correlate Transform Clean EGOWS 2016 - DWD 12 Standard operational monitoring via Nagios Most NinJo consortium members use Nagios (clone) for operational monitoring A Nagios Service monitors one specific parameter (e.g. CPU load of server X) Typical views (1) summary (2) service by problem-lvl (3) services for a machine EGOWS 2016 - DWD 13 NagVis – allows to summarize over NinJo Sites Graphically restrict the view to the important environments EGOWS 2016 - DWD 14 Monitoring summary: thread count of 52 processes OK 1) Service name 2) Current value 3) Warn-Level threshold 4) Critical-Level threshold EGOWS 2016 - DWD 15 critical error, 1/203 imports failed, but 28 /29 services OK EGOWS 2016 - DWD 16 CPU load (model import process, getting data every 6h) EGOWS 2016 - DWD 17 Memory (model import process, history over 3 days) EGOWS 2016 - DWD 18 Classic logging (I) Processes / applications write text log files (“event messages”) to a local file system or a network file system Typical log files include different log levels: INFO , WARN , ERROR , DEBUG Log messages are often free text In case of problems, logs might contain stack traces Not useful for operators & 1 st level support Crucial for developers / 3 rd level support Can easily fill “available” log space within minutes EGOWS 2016 - DWD 19 Classic logging (II) Each application (or component resp. developer) might use an own way of logging + log level assignment + syntax + content + granularity Management of log data Amount (with sufficient load, you cannot afford to run all time on DEBUG level) The log files are available for a specific time only (rolling behaviour: limited by file size or time) May be EOD compression and collection jobs Single file vs. multiple files for each process Content specific files (using different formats) EGOWS 2016 - DWD 20 Classic logging (III) SAMPLE: A certain amount of “application specific” knowledge is required to read and understand logs correctly… 4 elem. vs. no elem. imported, OK for a developer, but operator / administrator ???? EGOWS 2016 - DWD 21 Log event types (grade of implementation – subj. view) Negative events (Software or Resource Errors) Positive events (Successfully processed) Performance events (it took X sec to do action Y) Errors “that are not bad, but daily business” (e.g. organisation Z always sends corrupt data) “Static” log information Software version installed Extensions / Plugin components installed and started Configuration available / used Correlated “single” events (across machines / processes) EGOWS 2016 - DWD 22 Logs - Operation Systems / Network devices & Security Network devices / HW logging SNMP and syslog Standard tools for the Unix / Windows world Syslog(s) / Windows Event Logs Security related logging Firewalls / Intrusion detection Network / Resource Management User audit Compliance / Forensics SIEM – Security Information and Event Management EGOWS 2016 - DWD 23 Linux / *nix logging methods Linux / *nix OS’es syslog is the standard logging mechanism Syslog is a daemon and protocol, standardized by RFC3164 & RFC5424 Multiple implementations exist sysklogd Journald (as part of systemd) syslog works locally and over the network syslog has a rich ecosystem / technology stack around A lot of sources and destinations are supported, also filters & transformations NinJo 2.0 uses & that supports syslog EGOWS 2016 - DWD 24 Application Logging…. Technical: Libraries / Logging / Monitor FWKs (depended on your language) Java : LOG4J2, SLF , Java Logging API, Logback,… , JMX, jmxtrans , … C/C++: log4c, log4cxx, Boost.Log v2,… Python: logging (Std. Library) Non technical Methods / best practices for developers / implementation Guidelines Books (rare, recommend: “Logging and Log Management”) Logging / Events Industry Standards (finalized / accepted and used ?) Stopped all work Headers, Payload : Structure w. Key&Value, XML, JSON ?, parsing vs. readable EGOWS 2016 - DWD 25 Overview of our monitoring & logging types (selected tools) EGOWS 2016 - DWD 26 Syslog destinations A selection of possible destinations: Simple text files ELK stack (Splunk) NAGIOS Graphite Octopussy (picture) NoSQL DB RDBMS Key value storage – Redis Report generators (e.g. BIRT, Crystal Reports ) EGOWS 2016 - DWD 27 ELK open-source quasi-standard for logfile analysis ELK stack is an acronym for the combination of the three open-source tools Elasticsearch, Logstash, and Kibana. Elasticsearch is a search engine based on Lucene, a high-performance full-text search library. Logstash is a log pipe-line tool, accepting input from various sources, such as log files. It is able to filter and enrich the data and to export the information to Elasticsearch and other targets. Kibana is a web interface for Elasticsearch, providing searches and visualizations (e.g. sample next page). EGOWS 2016 - DWD 28 A sample of a multi-metric graph in Kibana EGOWS 2016 - DWD 29 Conclusions Technical side for monitoring & logging A lot of good tools are available (even for free) A lot of things are possible, but complex configuration is required (a lot of small design decisions tdb) Limit parameters to observe ( human / machine / storage) You need to be careful, it is easy to build up s.th. as complex as the application environment itself (requires even further maintenance) Content (is the challenge) Nothing around that can be adopted as-is Possible to adopt some ideas from here and there Difficult to get and agree good application specific parameters EGOWS 2016 - DWD 30 PBPV – 10/2013 31 Contact: Marcus Werner Referat FE ZE Frankfurter Str. 135 63037 Offenbach E-Mail: [email protected] Tel.: +49 (0) 69 / 8062 - 2076 EGOWS 2016 - DWD 32.

Load more