NADOG Self-Service Monitoring through versioned Infrastructure and Configuration as Code

February 24th, 2021 Self-Service Monitoring through versioned Infrastructure and Configuration as Code ABOUT ME

Carlos Munoz Robles Global e2e Monitoring Lead

[email protected] https://www.linkedin.com/in/carlosmur

What I love about my current job position as a Global e2e monitoring lead is the chance to foster the DevOps culture across Allianz, and make impact on our developer’s day-to- day work, offering them a state of the art solution in order to apply the DevOps principles in an easier way.

© Copyright Allianz 2 MONITORING LEGACY

© Copyright Allianz 3 Self-Service Monitoring through versioned Infrastructure and Configuration as Code COMPLEX MONITORING ECOSYSTEM

DCS – HC DCS- HC DCS AZ AT DCS - IBM AGCS on prem public cloud Mainframe AZ Applications Nagios Nagios Cloud watch System AZ Applications AZ Italy Tivoli vROPS Azure automation, / / Netcool Icinga AZELK Applications Nagios monitoring Netview, ZIS Telematics Prometheus/ Grafana / AZELK Applications Planed: Planed Nagios Prometheus/ Grafana / Applications AZELK Applications AZ UK Infrastructure ZENOSS ZENOSS Prometheus/ Grafana / - AZ Tech Branches ELK Prometheus/ Grafana / BAC AZ FR ELK OE IT Network devices Network QoS Tivoli V5 Traps & metrics FPI Oracle

Zabbix Grafana RFP/PoC ongoing Exadata ELK netscout …… APM APM

Filenet Oracle Dynatrace Appdymamics (AZ (AZD, AZ FR, AZ Tech, AZ Australia) Nagios Cloud control AZ Tech aplications UK, AZ Italy, Euler Fileservice APM (DFS / File exchange) and platfroms Hermes) Availability by central New Dynatrace Custom scripts / Grafana APM Prometheus / Grafana

New Relic Active Directory AVC AVC Infra Service SCOM (dedicated) SAP / SAP Basis One Web / one eGi (by DXC) Nextthink marketing SAP Solution manager Workplace Icinga GIAM Air Watch Zabbix Zabbix Security Hadoop ApplicationsGlobal & Platforms platforms Cloudera Security monitoring Customer Platforms

AZ Security tool © Copyright Allianz landscape: Archsight, 4 Splunk, Qualys, ……… Self-Service Monitoring through versioned Infrastructure and Configuration as Code MOTIVATION

The limitations in monitoring… … have been tackled with the implementation of an e2e monitoring

Too many outages with impact on customers and Increase service quality by continuously improving their business reliability and stability

No correlation inbetween technologies areas and Represent monitoring information in visualization in a single pane of glass which covers multidimensional, simplified and customized the entire vertical stack dashboards

Complex technology landscape with isolated Apply predictive analytics to enable monitoring focused on specific service areas automated actions for incident handling and (Accenture, IBM, …) prevention

Mostly reactive alerting only once the issue Improve root cause analysis and incident occurs resolution time by correlating all components

Centrally consolidate all monitoring Impact on quality, cost and customer information and connect to the CMDB satisfaction

© Copyright Allianz 5 Self-Service Monitoring through versioned Infrastructure and Configuration as Code COMMON DISADVANTAGES

PANE OF GLASS? ROLLOUTS? CONFIGURATION? OVERLAP? AUTOMATION? Complex and granular tool Are done manually Configuration is not traceable Different tools to cover Automation is missing stack the same use cases

© Copyright Allianz 6 Self-Service Monitoring through versioned Infrastructure and Configuration as Code BREAKDOWN INTO SUBSTREAMS Full

-User Experience Stack Application & Infrastructure Monitoring Infrastructure & Application Stack End

Business Transactions Application Performance Monitoring (APM) Application Middleware Establish e2e monitoring as a service Database

Operating System Infrastructure Monitoring

Server (BM / VM / CTR)

Storage, Backup, etc.

Network Cloud . -prem On

© Copyright Allianz 7 Self-Service Monitoring through versioned Infrastructure and Configuration as Code GLOBAL E2E MONITORING – 1ST ITERATION

Enrichment of CIs

GCCC AZ Tech MZ Dashboards Event Management & AZD MZ OEx MZ Support (24/7)

Get list of components affected CMDB (Dependencies between components)

Forward Critical Events ITOM (Service Discovery & Mapping) Shared services MZ Create new Incident

Multi-tenancy managed by Management zones (MZ) Ship events to the Central Event Management EVENTS

NetCool

BAC vROps

© Copyright Allianz LOGICAL & CLOUD INFRASTRUCTURE MONITORING 8 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code DESIRED STATE

• Maintainable • Understandable • Reproducible • Traceable • Versioned • Failure recovery o Roll-back configuration o Understand what happened • Avoid manual rollouts

© Copyright Allianz 9 SELF-SERVICE PORTAL

More than a simple ordering tool

© Copyright Allianz 10 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code SELF-SERVICE TOOLCHAIN

Monitoring

User Personal Configuration

Self-Service Inventory Jenkins Playbook / Module

© Copyright Allianz 11 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – DATA MODEL

Who is a user in context of MonitoringAsAService? stage: prod tenant_id: grh73865 oes: az-it: Enterprise structure: teams: map enterprise structure absi: cost-center: 1********8 user-administration: l top level: Operational Entity (67) admin-users: - name - [email protected] applications: l sub level: Team (Service) - name: cisl domain: to customer inventory - team name - cisl.allianz.it - cost-center id id: '623262027030619284' az-de: - users/members teams: (developers/owners/admins) actuarialplatforms: cost-center: 1********3 user-administration: admin-users: A user corresponds to an entry of the teams of an oe. - [email protected] applications: - name: unify_lh Each user‘s Agent must be tagged with: domains: - unifylife-prod-eu id: '4953220645034211986' pki: --set-host-group --

© Copyright Allianz 12 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code AFTER 6 MONTHS…

PREPRODUCTION PRODUCTION • 316 applications

• 27199 services

• 66135 processes

• 7392 hosts

• 18 datacenters

• 2145 users onboarded

© Copyright Allianz 13 PERSONAL CONFIGURATION

© Copyright Allianz 14 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code WHY DO WE NEED TO TALK ABOUT THIS?

What about monitored entities not configurable via the User has full control over Agent properties: dev Agent? - logical grouping of hosts (by adding names/tags) these are for example: - stage selection (prod/dev) backend prod - applications - turn monitoring on/off frontend prod - synthetic monitors („faked“ network access to apps) - switch between full-stack/infra-only monitoring az-tech-globalmonitoring-eag - Cloud APIs - processes/services running on unmonitored hosts

Usually only configurable via the Web GUI.

What about data privacy? - users want to keep alerts/problems/root causes private Permission issue - users want or even have to restrict their metrics‘ Dynatrace‘s architecture doesn‘t provide visibility (eg. request logs) → GDPO - neither GUI access - users want to hide their infrastructure (number of hosts, Solution - nor API access computation resources, databases, …) limited by visibility filters. - sensitive credentials must be kept 100% secret! ConfigurationAsCode - user‘s admin team may want to define maintenance Block settings GUI/API access. windows where specific alerting is turned off + GitOps + Automation Make everything invisible by default for all users! = MonitoringAsAService

Only explicitly set visibility filters can reveal information to explicitly onboarded user groups.

© Copyright Allianz 15 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code CONFIGURATION AS CODE? GITOPS?

Treat the application‘s configuration as if it was code.

© Copyright Allianz 16 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code CONFIGURATION AS CODE?

Treat the application‘s configuration as if it was code.

Advantages

l Human readable

l Comprehensible

l Transparent

l Versionable (Git)

l Automizable

© Copyright Allianz 17 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code CONFIGURATION AS CODE?

Treat the application‘s configuration as if it was code.

Advantages

l Human readable

l Comprehensible

l Transparent

l Versionable (Git)

l Automizable

Target groups

l Managers

l Service owners

l DevOps engineers

l Developers

l Operators

© Copyright Allianz 18 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code CONFIGURATION AS CODE?

Treat the application‘s configuration as if it was code.

Advantages

l Human readable

l Comprehensible

l Transparent

l Versionable (Git)

l Automizable

Target groups

l Managers

l Service owners

l DevOps engineers

l Developers

l Operators

© Copyright Allianz 19 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code CONFIGURATION AS CODE?

Treat the application‘s configuration as if it was code.

Advantages

l Human readable

l Comprehensible

l Transparent

l Versionable (Git)

l Automizable

Target groups

l Managers

l Service owners

l DevOps engineers

l Developers

l Operators

© Copyright Allianz 20 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code CONFIGURATION AS CODE?

Treat the application‘s configuration as if it was code.

Advantages

l Human readable

l Comprehensible

l Transparent

l Versionable (Git)

l Automizable

Target groups

l Managers

l Service owners

l DevOps engineers

l Developers

l Operators

© Copyright Allianz 21 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code CONFIGURATION AS CODE?

Treat the application‘s configuration as if it was code.

Advantages

l Human readable

l Comprehensible

l Transparent

l Versionable (Git)

l Automizable

Target groups

l Managers

l Service owners

l DevOps engineers

l Developers

l Operators

© Copyright Allianz 22 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code WHAT DO YOU NEED TO MONITOR?

Hosts Cloud Infrastructure

• VMs • VMs, EC2 • Container/Pods • Load Balancers • Bare Metal • Auto Scaling Groups • Mainframe • Storage, RDB, Lambda, ...

Services Applications

• DBs • Availability • Web Services • Response Times • Micro Services • Load Times • Citrix • User Load

© Copyright Allianz 23 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – MANAGEMENT ZONE

Alerts & Notifications Cloud Platforms Credentials

External Metrics Sources (Agent) Lifecycle Management Tags

User Feedback

© Copyright Allianz 24 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – KUBERNETES CLUSTER

© Copyright Allianz 25 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – KUBERNETES CLUSTER

© Copyright Allianz 26 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – KUBERNETES CLUSTER

© Copyright Allianz 27 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – ENCRYPTION SERVICE

https://cac.e2e-mon.ec1.aws.aztec.cloud.allianz

© Copyright Allianz 28 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – KUBERNETES CLUSTER

© Copyright Allianz 29 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION - SECRET SUBSTITUTION cac_credentials_manager.py : content of prod-mz-az-tech-globalmonitoring/credentials/

def read_files(self, files: List[Path], private_key, passphrase: Optional[str] = None): private counter part to cac.e2e- parsed_keys = dict() mon.ec1.aws.aztec.cloud.allianz

for credentials_file in files: with open(credentials_file, "r") as f: pem_content = f.read()

decrypted_content = CMSDecryptor.decrypt(pem_content, private_key, passphrase) parsed_yaml = self.__parse_credentials_yaml(decrypted_content)

for key, value in parsed_yaml.items(): if key not in parsed_keys: parsed_keys[key] = list() parsed_keys[key].append((credentials_file, value))

for key, occurrence_list in parsed_keys.items(): file, value = occurrence_list[0] © Copyright Allianz 30

self.__credential_sources[key] = file Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION - SECRET SUBSTITUTION

cac_credentials_manager.py content of prod-mz-az-tech-globalmonitoring/kubernetes/ :

def __substitute_credentials(self, local_configurations: Dict[str, Dict]):

# search for „ {{ }} {str}“ and assign match to credential_ref_key # matches eg. authToken: {{ eks_ec1_bearertoken }} substitution_pattern = re.compile(r"^\s?{{\s?(?P[^\s]+)\s?}}\s?$")

for key in local_configurations.keys(): configuration = local_configurations.get(key)

for credential_key in self._get_credential_keys(configuration): if credential_key in configuration: reference = configuration[credential_key]

credential_match = substitution_pattern.match(reference) credential_ref_key = credential_match.group("credential_ref_key") credential_secret = self.__credentials_manager.get_credential(credential_ref_key)

configuration[credential_key] = credential_secret

© Copyright Allianz 31 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

© Copyright Allianz 32 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

© Copyright Allianz 33 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

git push

web hook trigger git clone ...

dedicated config job

© Copyright Allianz 34 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

git push

web hook trigger git clone ... git clone ...

dedicated config job

© Copyright Allianz 35 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

git push

web hook trigger git clone ... git clone ...

dedicated config job

© Copyright Allianz 36 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

git push

web hook trigger git clone ... git clone ... ActiveGate Cluster

dedicated config job

© Copyright Allianz 37 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

git push

web hook trigger git clone ... git clone ... ActiveGate Cluster

copy certificates

dedicated config job

© Copyright Allianz 38 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

git push

web hook trigger git clone ... git clone ... ActiveGate Cluster

copy certificates update trusted.jks

dedicated config job

© Copyright Allianz 39 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION – GITOPS WORKFLOW

git push

web hook trigger git clone ... git clone ... ActiveGate Cluster

copy certificates update trusted.jks

dedicated config job

1 Jenkins server with 1 set of credentials/token for l all customers‘ config jobs l all stages l all global config jobs l full scalability

© Copyright Allianz 40 Self-Service Toolchain/Monitoring through versioned Infrastructure and Configuration as Code PERSONAL CONFIGURATION - RESULT

© Copyright Allianz 41 Transformation is a journey that doesn’t really end

© Copyright Allianz 24-Feb-21 42 Self-Service Monitoring through versioned Infrastructure and Configuration as Code WHAT’S NEXT?

Quality Gates Continuous delivery Application Performance SLI/SLO Monitoring

Real-time Applications Data Science

Monitoring and tracking the performance of BA Enrichment of CIs Propagate Critical Analyse Data Events to BA

Dynamic Data Lake Storage BI and Analytics Tools Dashboards Business Analytics Streaming Data Sources CMDB

Detect security anomalies Transfer raw data Handle real-time to the Data Lake Expose full historical data Metrics Events Logs data from BA Event Management set for forensic analysis

Query dependencies Specialized Security between components Monitoring

43 © Copyright Allianz SLI : the test results / SLO : the target objectives THANK YOU!

Follow our socials for more impressions!