OpenBMC - Platform Telemetry

Neeraj Ladkani – [email protected]

Open Source Firmware Conference 2019 Cloud Telemetry Conundrums

• The rise and rapid evolution of data analytics, AI and machine learning workloads have significant impact on cloud hardware design.

• Commercial Cloud Infrastructure requires high availability and need state of art telemetry to build and predict failsafe models.

• BMC role has evolved from legacy hardware management service to central intelligent controller serving cloud control plane operations. Specialization with Standardization

• Processors • Power Supply • Processors errors and CPU Crash dump • Fault history • Memory • Energy storage attributes • Memory Correctable and uncorrectable errors • Consumption history • IO • BMC • PCIe Correctable and uncorrectable errors • SMART data for disks • Firmware Stats • Add on cards and custom silicon • Request and Response history • Thermal data • BMC CPU/Memory/Flash stability • Vendor specific telemetry • Mainboard HW • Host Subsystem • Hot Swap Controller Faults • OS heartbeat • Network link status • Voltage Regulator Faults Objective

• Standardize telemetry model

• Design a configurable BMC telemetry and health monitoring framework for OpenBMC platforms ( hardware, thermal, power, BMC and custom )

• Provide a generic interface to remotely access the metric data using both a push and pull model. Possible Solutions

• Custom Daemons for every subsystem and custom IPMI/Redfish to push telemetry information • Use native binary blobs and OEM URIs

• Custom methods to specify telemetry parameters like metric definition, sensing interval, specifying triggers Telemetry Collection Subsystem

• Use “” for collecting metrics.

• “collectd” plugins can be written or provided by subsystem owners to collect metrics ( Hardware as Service).

• Integrating IPMI and Redfish subsystems with collectd using intermediate translation services.

• Supports aggregation of metrics data, which enables space-efficient storage of data. Redfish Telemetry Model

• Use Standard Redfish telemetry model (Credit : Paul Vancil )

• Flexible, extendible and complete for OpenBMC client interfaces

• Supports push (Redfish event model) and pull model ( Event logs)

• Supports Triggers for specific scenarios Redfish Telemetry – Sample Metric Report

Source: https://www.dmtf.org/documents/redfish-spmf/redfish-telemetry-white-paper-010a Get Involved

• Workgroup call ( Bi-weekly) https://github.com/openbmc/openbmc/wiki/Platform-telemetry-and-health-monitoring-Wor k-Group

• Community requirements https://docs.google.com/spreadsheets/d/12gMMXB9r_WfWDf5wz-Z_zXsz6RNheC6p2LKp7H ePAEE/edit?usp=sharing

• Design proposals https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/22257

https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/23758

https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/24357 OpenBMC Metrics Collection

Proposal + Progress on Collectd Integration

Kun Yi ([email protected]) Open Source Firmware Conference 2019 Context: What are "metrics"?

● "a degree to which a software system or process possesses some property" -- Wikipedia ● Timeseries data ● Sensor value is a good example for BMC systems ● Metrics enable monitoring system data at scale such as: ○ How much performance improves across the fleet when the BMCs are updated? ○ How many times do BMCs report thermal throttling on a group of machines running a heavy load? Context: Characteristics of Metrics

Characteristics Metric Log Event

Generally numeric Yes No Maybe

Time Interval Regular Irregular Irregular

Urgent No No Maybe

Target Automation Human Human/Automation

Impact of losing a data point Low Medium High

Examples CPU load dmesg/kmesg catastrophic GPIO memory usage systemd journal sensor value over threshold disk usage rsyslog system disk full daemon restart count ...... system uptime ... Context: Characteristics of Metrics

Characteristics Metric Log Event

Generally numeric Yes No Maybe

Time Interval Regular Irregular Irregular

Urgent No No Maybe

Target Automation Human Human/Automation

Impact of losing a data point Low Medium High

Examples CPU load dmesg catastrophic GPIO memory usage systemd journal sensor value over threshold disk usage rsyslog system disk full daemon restart count ...... system uptime ...

Existing OpenBMC collection Adhoc solutions systemd journal IPMI SEL rsyslog Redfish events Redfish logging Context: Collectd and RRDTool

● Collectd [1] ○ Metrics collection daemon ○ Written in C ○ Highly configurable ○ Over 100 plugins available ○ Supports various data formats including CSV and RRD ○ Used in OpenWRT ● RRDTool [2] ○ Based on RRD format ○ Includes shared library "libRRD" Context: Round Robin Database (RRD)

● Stores data in a circular buffer ● Automatically aggregates data according to configuration ● Constant size

Primary Data Point

Updates PDP PDP PDP PDP Consolidation Consolidated Data Point Function CF

Round Robin CDP CDP CDP . . . Archive (RRA) RRD Format Illustration

Credit: Gabriel Matute Design: Requirements

● Must be able to persist certain critical metrics ● Resource-friendly ○ Trade-off between storage and amount of data to persist ○ Persist only the important data ○ External program can scrape from BMC frequently ● Common, simple interface for instrumenting System Diagram (Illustrative) Progress

● Created Proof-of-concept ○ Collect BMC load and memory usage using Collectd plugins ○ Use OEM IPMI command to transfer data to the host ○ Host translates data to feed into other collection frameworks ● Preliminary study on resource consumption ○ Default bitbake recipe for rrdtool includes too many dependencies Progress: Resource Consumption

● Tested based on OpenBMC 2.7, ARMv7a ● Image size ○ By default rrdtool recipes includes , python, graphic libs.. ○ Building default rrdtool+collectd takes >7MB of flash space after xz compression ○ Building the minimally required recipe trims it down to 2.6MB ● CPU/Memory ○ With a few metrics being collected, memory consumption is ~4.8MB ○ CPU usage is ~1% ○ Will increase with the number of metrics being collected ● RRD file size ○ 23KB for 1 metric updated every 30s and kept for a day Future

● Configurable RRDtool recipe to drop unnecessary dependencies ● More code into librrd+ (librrd C++ wrapper) ● Look into generating events ● Look into tagging metrics ○ RRD file has no intrinsic string meta fields ○ "Collectd is moving (slowly but calmly) towards implementing arbitrary key/value attributes attached to each value. " ● Redfish Telemetry Metric Report ○ Current proposal of JSON definition [3] References

[1] Collectd: https://github.com/collectd/collectd [2] RRDtool: https://github.com/oetiker/rrdtool-1.x [3] DMTF Redfish API JSON definition: https://redfish.dmtf.org/schemas/v1/MetricReportDefinition.v1_2_0.json Credits

Gabriel Matute for his awesome work as an intern! Questions?