Monitoring the Backend and Logging Improvements for operations and support

27 th EGOWS Helsinki 20 th – 22 nd Sep 2016, Marcus Werner, DWD Overview

Ï Stakeholders and a enhancement request from operations & support

Ï Goals, Ideas & Challenges

Ï Extract, transform, reduce, store and present support information

Ï Good (free) available technology stack

Ï Non-existing standards

EGOWS 2016 - DWD 2 The meteorological workstation system - NinJo

Ï Developed by an international Consortium

Ï Primary tool for our operational forecasters

…and they need a stable and well supported system (7 x 24h) EGOWS 2016 - DWD 3 … typically forecasters ask for meteorological improvements

EGOWS 2016 - DWD 4 Technical Infrastructure staff (TI) operates NinJo …

…now we have implemented some of their feature requests…

… this is finally in the interest of the forecasters.

EGOWS 2016 - DWD 5 Our stakeholders

Ï End Users: Forecasters (work with meteorological data)

Ï Support and operations (work with monitoring systems, log files, admin tools)

ÏAdopting from

ÏIT operations (7x24h)

Ï1st level support (on call)

Ï2nd level support (office hours)

Ï3rd level support (office hours), might be external

EGOWS 2016 - DWD 6 Challenges

Ï Different support levels need specific “tailored” support information

Ï DWD operates ~200 NinJo Servers ( Routine , Dev/Eval , QA +Special Systems )

Ï Highly distributed environment over multiple DWD locations

Ï Up to ~75 processes per NinJo Site

Ï Resource limitations for support and operations (split into multiple teams)

EGOWS 2016 - DWD 7 Application Developers vs. Operators /1 st Level Support

Ï Operators & 1 st Level Support take care of “many” applications, they will naturally have a limited insight knowledge

Ï 3rd Level Support / Appl. Developers highly specialized in the application, they know all the details

Ï TI admins do not like to work with application specific graphical tools ( shell )

Ï Log and monitor information has to fulfil the needs of all involved support teams

EGOWS 2016 - DWD 8 Goals and ideas (I)

Ï Move towards “pro-active” monitoring

Ï Reduce reaction times

Ï Simplify technical hurdles for administrators

Ï Increase automation level ( ‰ parsing, correlation)

Ï Use similar tools and methods for applications (for us NinJo) as used for data centre & infrastructure components

Ï Standardize the log content

EGOWS 2016 - DWD 9 Goals and ideas (II)

Ï Each support level should get the “right” information (& not more)

Ï Support of daily operations and during / after changes being applied (software version, data, hardware, firmware, network, configuration)

Ï Different kinds of log information needs to be kept for specific hold times (few days, up to weeks, months and years)

Ï We want to keep log data for all operational NinJo Sites for 5 - 7 days (enough to cover for a long weekend)

Ï Implement an easier access to log / monitor data (incl. history)

EGOWS 2016 - DWD 10 Monitoring vs. log message information (I)

Ï A monitoring application is like a (restrict to OK , Warning , Error )

Ï Log files contain mostly (all) details

Ï Log files can be used as input for monitoring applications

Ï Monitoring applications can be feed via scripts (cron’ed + push )

Ï Monitoring applications can collect information from multiple machines / processes ( pull )

EGOWS 2016 - DWD 11 Monitoring vs. log message information (II)

Dump

Transport

Filter

Correlate

Transform

Clean

EGOWS 2016 - DWD 12 Standard operational monitoring via

Ï Most NinJo consortium members use Nagios (clone) for operational monitoring Ï A Nagios Service monitors one specific parameter (e.g. CPU load of server X) Ï Typical views (1) summary (2) service by problem-lvl (3) services for a machine

EGOWS 2016 - DWD 13 NagVis – allows to summarize over NinJo Sites

Ï Graphically restrict the view to the important environments

EGOWS 2016 - DWD 14 Monitoring summary: thread count of 52 processes OK

1) Service name 2) Current value 3) Warn-Level threshold 4) Critical-Level threshold

EGOWS 2016 - DWD 15 critical error, 1/203 imports failed, but 28 /29 services OK

EGOWS 2016 - DWD 16 CPU load (model import process, getting data every 6h)

EGOWS 2016 - DWD 17 Memory (model import process, history over 3 days)

EGOWS 2016 - DWD 18 Classic logging (I)

Ï Processes / applications write text log files (“event messages”) to a local file system or a network file system

Ï Typical log files include different log levels: INFO , WARN , ERROR , DEBUG

Ï Log messages are often free text

Ï In case of problems, logs might contain stack traces ÏNot useful for operators & 1 st level support ÏCrucial for developers / 3 rd level support ÏCan easily fill “available” log space within minutes

EGOWS 2016 - DWD 19 Classic logging (II)

Ï Each application (or component resp. developer) might use an own way of logging

+ log level assignment + syntax + content + granularity

Ï Management of log data

ÏAmount (with sufficient load, you cannot afford to run all time on DEBUG level) ÏThe log files are available for a specific time only (rolling behaviour: limited by file size or time) ÏMay be EOD compression and collection jobs ÏSingle file vs. multiple files for each process

ÏContent specific files (using different formats)

EGOWS 2016 - DWD 20 Classic logging (III)

Ï SAMPLE: A certain amount of “application specific” knowledge is required to read and understand logs correctly…

4 elem. vs. no elem. imported, OK for a developer, but operator / administrator ????

EGOWS 2016 - DWD 21 Log event types (grade of implementation – subj. view)

Ï Negative events (Software or Resource Errors)

Ï Positive events (Successfully processed)

Ï Performance events (it took X sec to do action Y)

Ï Errors “that are not bad, but daily business” (e.g. organisation Z always sends corrupt data)

Ï “Static” log information Ï Software version installed Ï Extensions / Plugin components installed and started Ï Configuration available / used

Ï Correlated “single” events (across machines / processes)

EGOWS 2016 - DWD 22 Logs - Operation Systems / Network devices & Security

Ï Network devices / HW logging ÏSNMP and

Ï Standard tools for the / Windows world ÏSyslog(s) / Windows Event Logs

Ï Security related logging ÏFirewalls / Intrusion detection ÏNetwork / Resource Management ÏUser audit ÏCompliance / Forensics ÏSIEM – Security Information and Event Management

EGOWS 2016 - DWD 23 / *nix logging methods

Ï Linux / *nix OS’es syslog is the standard logging mechanism Ï Syslog is a daemon and protocol, standardized by RFC3164 & RFC5424 Ï Multiple implementations exist Ïsysklogd Ï Ï ÏJournald (as part of systemd) Ï syslog works locally and over the network Ï syslog has a rich ecosystem / technology stack around Ï A lot of sources and destinations are supported, also filters & transformations Ï NinJo 2.0 uses & that supports syslog

EGOWS 2016 - DWD 24 Application Logging….

Ï Technical: Libraries / Logging / Monitor FWKs (depended on your language) ÏJava : LOG4J2, SLF , Java Logging API, Logback,… , JMX, jmxtrans , … Ï /C++: log4c, log4cxx, Boost.Log v2,… Ï Python: logging (Std. Library)

Ï Non technical ÏMethods / best practices for developers / implementation ÏGuidelines ÏBooks (rare, recommend: “Logging and Log Management”) Ï Logging / Events Industry Standards (finalized / accepted and used ?)

Stopped all work

Ï Headers, Payload : Structure w. Key&Value, XML, JSON ?, parsing vs. readable

EGOWS 2016 - DWD 25 Overview of our monitoring & logging types (selected tools)

EGOWS 2016 - DWD 26 Syslog destinations

A selection of possible destinations: Ï Simple text files Ï ELK stack (Splunk) Ï NAGIOS Ï Graphite Ï (picture)

Ï NoSQL DB Ï RDBMS Ï Key value storage – Redis

Ï Report generators (e.g. BIRT, Crystal Reports )

EGOWS 2016 - DWD 27 ELK open-source quasi-standard for logfile analysis

Ï ELK stack is an acronym for the combination of the three open-source tools Elasticsearch, Logstash, and Kibana.

Ï Elasticsearch is a search engine based on Lucene, a high-performance full-text search library. Ï Logstash is a log pipe-line tool, accepting input from various sources, such as log files. It is able to filter and enrich the data and to export the information to Elasticsearch and other targets. Ï Kibana is a web interface for Elasticsearch, providing searches and visualizations (e.g. sample next page).

EGOWS 2016 - DWD 28 A sample of a multi-metric graph in Kibana

EGOWS 2016 - DWD 29 Conclusions

Ï Technical side for monitoring & logging ÏA lot of good tools are available (even for free) ÏA lot of things are possible, but complex configuration is required (a lot of small design decisions tdb) ÏLimit parameters to observe ( ‰ human / machine / storage) ÏYou need to be careful, it is easy to build up s.th. as complex as the application environment itself (requires even further maintenance)

Ï Content (is the challenge) ÏNothing around that can be adopted as-is ÏPossible to adopt some ideas from here and there ÏDifficult to get and agree good application specific parameters

EGOWS 2016 - DWD 30 PBPV – 10/2013 31 Contact:

Marcus Werner Referat FE ZE Frankfurter Str. 135 63037 Offenbach

E-Mail: [email protected] Tel.: +49 (0) 69 / 8062 - 2076

EGOWS 2016 - DWD 32