ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA Y SISTEMAS DE TELECOMUNICACIÓN PROYECTO FIN DE GRADO

TÍTULO: ENTERPRISE ARCHITECTURE AUTOMATION WITH OPENSOURCE TOOLS

AUTOR: FERNANDO GARZÁS MARTÍN DE ALMAGRO

TITULACIÓN: GRADO EN INGENIERÍA TELEMÁTICA

DIRECTOR: ALEJANDRO ALONSO FERNANDEZ TUTOR: CARLOS RAMOS NESPEREIRA

DEPARTAMENTO: INGENIERÍA TELEMÁTICA Y ELECTRÓNICA

VºBº

Miembros del Tribunal Calificador:

PRESIDENTE: MARTINA ECKERT

TUTOR: CARLOS RAMOS NESPEREIRA

SECRETARIO: FRANCISCO JAVIER RAMIREZ LEDESMA

Fecha de lectura:

Calificación:

El Secretario,

Enterprise architecture automation with opensource tools

2 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Resumen

Cualquier compañía cuyo negocio se base en el desarrollo y la venta de productos software, debe garantizar el correcto desempeño y rendimiento de los productos que entrega a sus clientes. Para ello, es fundamental realizar una correcta monitorización de las herramientas, lo cual implica un coste operativo que en determinadas ocasiones puede ser elevado. Este coste operativo puede ser asumido por la compañía proveedora, o por el cliente final.

En el caso del producto Expresse ®, de la compañía ASSIA Inc. ([2]), esta labor de monitorización se realiza de forma manual por los ingenieros del equipo de soporte de cada cliente, y no se cuenta con un sistema software que permita realizar una monitorización eficiente de los diferentes procesos de negocio ejecutados por el software, ni una correcta monitorización de las plataformas en las que los productos están desplegados. En el presente proyecto se han elegido herramientas que minimicen ese coste operativo, y en base a ellas se ha implementado un sistema de monitorización automáticamente desplegado.

Como parte del trabajo, se han determinado los indicadores clave de procesos (KPI) del producto

Expresse que deben ser monitorizados. Igualmente, se han determinado los indicadores de rendimiento de los servidores pertenecientes a la plataforma que resulten ser relevantes para la operación de esta. Por otro lado, se han determinado los tipos de arquitecturas que deben ser monitorizados, resultando dos: arquitecturas mono-cliente, y arquitecturas multi-cliente.

Con respecto a las herramientas de monitorización, se han evaluado dos: Kibana ([1]) y Grafana

([8]). Finalmente, se ha seleccionado Grafana, al estar siendo utilizada por otros departamentos dentro de la empresa, facilitando así simplificar los procesos de monitorización al tener una plataforma global de monitorización, que incluye otros productos aparte de Expresse.

Enterprise architecture automation with opensource tools 3

Enterprise architecture automation with opensource tools

Así mismo, se han analizado cuatro herramientas de automatización de despliegues: ([16]),

SaltStack ([18]), Rudder ([11]) y CFEEngine ([12]). De éstas, ha sido seleccionada Ansible, por la simpleza de su arquitectura, su versatilidad y escalabilidad, y su adecuación a los tipos de arquitecturas que se deben gestionar.

4 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Abstract

Software companies developing their own products need to ensure product operation quality and performance after being handed out to their customers. Given that, it is crucial to monitor effectively.

The cost of resources to do is an operational expense, which sometimes may be high. These costs might be afforded either by software provider, or by customer itself.

In case of Expresse ®, an ASSIA Inc. product ([2]), this monitoring work is currently performed manually by ASSIA support team engineers of each account and does not have any tool to efficiently perform either business process monitoring or platforms monitoring. In this project, tools have been identified to help reduce those operational expenses and an auto-deployed monitoring system has been implemented.

As part of the work, KPIs have been identified both for business process monitoring and for platform monitoring. In parallel, two different architectures have been defined: single-customer and multi- customer.

Regarding monitoring activity, two tools have been evaluated: Kibana ([1]) and Grafana ([8]). The latter has been chosen as it is being used on other ASSIA processes to monitor another family of products (Cloudcheck). This way, global monitoring processes are simplified. In addition, learning curve for those systems is reduced.

Additionally, four different deployment automation tools have been evaluated: Ansible ([16]),

SaltStack ([18]), Rudder ([11]) and CFEEngine ([12]). From them, Ansible has been chosen, due to its simplicity of operation, its versatility, scalability, and its suitability to the different architectures to be set up.

Enterprise architecture automation with opensource tools 5

Enterprise architecture automation with opensource tools

6 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Revision History

Part Number Date Comments TR-XX-YYYY-  Document created with first page meeting DD-MM UPM requirements.  Included abstract  Included first version of Table of Contents, Acronym list, and introduction  Included Introduction  Included Technological Framework. 27/05/2019  Included changes from first review 28/05/2019  Added KPI definitions 10/06/2019  Improved KPI definitions. Included diagrams, compliancy and budget. Performed minor revisions. 17/06/2019  Reorganized documentation to add more details in Proposal Description chapter 19/06/2019  Added architecture diagrams. Included more details on solutions used. Added implementation details. Added user guide and conclusions. 02/07/2019  Reviewed document format. Added chapter “Future Works”

Enterprise architecture automation with opensource tools 7

Enterprise architecture automation with opensource tools

8 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Table of Contents REVISION HISTORY ...... 7 ABBREVIATIONS, ACRONYMS AND SYMBOLS ...... 13 1 INTRODUCTION ...... 17 2 PRECEDENTS/TECHNOLOGICAL FRAMEWORK ...... 19 2.1 EXPRESSE SUITE OUTLINE ...... 19 2.1.1 EXPRESSE ARCHITECTURE DESIGN ...... 20 2.1.2 EXPRESSE HARDWARE ...... 25 2.2 THIRD PARTY TOOLS...... 25 2.2.1 IT DEPLOYMENT AUTOMATION SELECTION PROCESS ...... 26 2.2.2 MONITORING SOFTWARE REQUIREMENTS ...... 31 2.3 HIGHLIGHTS OF SELECTED TOOLS ...... 35 3 DESIGN SPECIFICATION AND RESTRICTIONS ...... 37 3.1 GLOBAL SPECIFICATIONS ...... 37 3.2 GENERIC RESTRICTIONS ...... 37 3.3 BUSINESS PROCESS KPI SPECIFICATION ...... 38 3.4 GENERIC KPI SPECIFICATION...... 47 3.4.1 GENERIC HARDWARE INDICATORS ...... 47 3.4.2 DATABASE (ORACLE) SPECIFIC INDICATORS ...... 48 3.5 DASHBOARD DEFINITION ...... 50 4 PROPOSAL DESCRIPTION ...... 51 4.1 IT DEPLOYMENT AUTOMATION ...... 51 4.1.1 SELECTION PROCESS AND FINAL DECISION ...... 51 4.1.2 ANSIBLE TO AUTOMATE DEPLOYMENTS ...... 53 4.2 MONITORING FRAMEWORK ...... 54 4.3 ARCHITECTURES PROPOSAL ...... 55 4.4 ANSIBLE INSTALLATION ...... 59 4.5 ANSIBLE ROLES SPECIFICATION ...... 60 4.6 ANSIBLE INVENTORIES ...... 61 4.7 ANSIBLE PLAYBOOKS ...... 61 5 RESULTS ...... 63 5.1 INSTALL_INFLUX ROLE TESTING ...... 63 5.2 INSTALL_TELEGRAF ROLE TESTING IN NORMAL SERVER ...... 65 5.3 INSTALL_TELEGRAF ROLE TESTING IN DATABASE SERVER ...... 67 5.4 INSTALL_GRAFANA ROLE TESTING ...... 70 6 BUDGET ...... 77

Enterprise architecture automation with opensource tools 9

Enterprise architecture automation with opensource tools

7 CONCLUSIONS ...... 79 8 FUTURE WORKS ...... 81 REFERENCES ...... 83 ANNEX A: DIAGRAMS ...... 85 ANNEX B: OPEN SOURCE LICENSES OF ANALYZED SOFTWARE ...... 93 ANNEX C: USER GUIDE ...... 95 ANNEX D: DEPLOYMENT AUTOMATION TOOL ANALYSIS MATRIX ...... 103

10 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Table of Figures

FIGURE 1. SINGLE-CUSTOMER SCENARIO ARCHITECTURE ...... 57 FIGURE 2. MULTI-CUSTOMER SCENARIO ARCHITECTURE ...... 59 FIGURE 3. INSTALL_INFLUX ROLE PRE-EXECUTION CHECKS ...... 63 FIGURE 4. INSTALL_INFLUX ROLE EXECUTION EVIDENCE ...... 64 FIGURE 5. INSTALL_INFLUX ROLE POST-EXECUTION CHECK ...... 64 FIGURE 6. INSTALL_TELEGRAF ROLE PRE-EXECUTION TESTS ...... 65 FIGURE 7. INSTALL_TELEGRAF ROLE EXECUTION EVIDENCE ...... 66 FIGURE 8. INSTALL_TELEGRAF ROLE POST-EXECUTION CHECK (1/3) ...... 66 FIGURE 9. INSTALL_TELEGRAF ROLE POST-EXECUTION CHECK (2/3) ...... 66 FIGURE 10. INSTALL_TELEGRAF ROLE POST-EXECUTION CHECK (3/3) ...... 67 FIGURE 11. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. PRE-EXECUTION CHECK ...... 67 FIGURE 12. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (1/4) ...... 68 FIGURE 13. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (2/4) ...... 68 FIGURE 14. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (3/4) ...... 69 FIGURE 15. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (4/4) ...... 69 FIGURE 16. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. POST-EXECUTION CHECKS (1/2) ...... 70 FIGURE 17. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. POST-EXECUTION CHECKS (2/2) ...... 70 FIGURE 18. INSTALL_GRAFANA ROLE PRE-EXECUTION CHECKS ...... 71 FIGURE 19. INSTALL_GRAFANA ROLE EXECUTION EVIDENCE ...... 71 FIGURE 20. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (1/8) ...... 72 FIGURE 21. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (2/8) ...... 72 FIGURE 22. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (3/8) ...... 73 FIGURE 23. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (4/8) ...... 73 FIGURE 24. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (5/8) ...... 74 FIGURE 25. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (6/8) ...... 74 FIGURE 26. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (7/8) ...... 75 FIGURE 27. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (8/8) ...... 75 FIGURE 28. GENERIC EXPRESSE INTEGRATION ARCHITECTURE ...... 85 FIGURE 29. REGION REPRESENTATION FOR BIG DEPLOYMENTS ...... 87 FIGURE 30. FULL-TIER APPLICATION SERVER ...... 87 FIGURE 31. ROLE-SPECIFIC APPLICATION SERVERS ...... 88 FIGURE 32. EXPRESSE SIMPLE DB ARCHITECTURE ...... 88 FIGURE 33. EXPRESSE COMPLEX DB ARCHITECTURE ...... 89 FIGURE 34. EXPRESSE DB SERVERS SYNCHRONIZED WITH ORACLE DATA GUARD ...... 89 FIGURE 35. MOST COMPLEX EXPRESSE ARCHITECTURE ...... 90

Enterprise architecture automation with opensource tools 11

Enterprise architecture automation with opensource tools

12 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Abbreviations, Acronyms and Symbols

AAA Authentication-Authorization-Auditing AGPL GNU Affero General Public License AI Artificial Intelligence API Application Program Interface ARPU Average Revenue per User ASSIA Inc Adaptive Spectrum and Signal Alignment (Incorporated) BSS Business Support System CLI Command-Line interface CPE Customer Premises Equipment CPU Central Processing Unit DBA Database Administrator dbloader Database Loader (Expresse module) DcPc Data Collection and Profile Change (Expresse module) DLM Dynamic Line Management DS Downstream DSL Digital Subscriber Lines DSLAM Digital Subscriber Line Access Multiplexer FCGT Full Garbage Collection Time FQDN Fully Qualified Domain Name GNU GNU's Not Unix GPL GNU public license GUI Graphical User Interface HA High Availability HTTP Hypertext Transfer Protocol IOPS Input/Output operations per second IP Intellectual Property IP Internet Protocol ISP Internet Service Provider JMX Java Management eXtensions JVM Java Virtual Machine KPI Key Performance Indicator

Enterprise architecture automation with opensource tools 13

Enterprise architecture automation with opensource tools

LATAM Latin-America LGPL GNU Lesser General Public License MABR Maximum achievable bit rate MIT Massachusetts Institute of Technology MSAN Multiservice Access Node NAPI Northbound API NMS Network Management System NOC Network Operations Center ODN Optical Distribution Network OLT Optical Line Terminal OPEX Operational Expense OSP Outside Plant OSPM Outside Plant Management OSS Operations Support System PE Performance Evaluator (Expresse module) PO Profile Optimizer (Expresse module) PON Passive Optical Network POP_O Pop operational data POP_P Pop performance data QoS Quality of Service RAC Real Application Cluster (Oracle Product) RAM Random Access Memory RHEL RedHat Enterprise Linux ROI Return of Investment S2S VPN Site-to-Site Virtual Private Network SAN Storage Access Network SCAN Single Client Access Name (Oracle feature) SNMP Simple Network Management Protocol SOAP Simple Object Access Protocol SQL Structured Query Language SR Service Recommender SSH Secure Shell TICK Telegraf, Influx, Chronograf and Kapacitor

14 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

TPS Transactions per second TR Technical Report TV Television UI User Interface UIWS User-Interface Web-Service URL Uniform Resource Locator US Upstream VM Virtualized Machine

Enterprise architecture automation with opensource tools 15

Enterprise architecture automation with opensource tools

16 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

1 Introduction

Adaptive Spectrum and Signal Alignment, Incorporated (ASSIA Inc.), is an American company

based on Redwood City, (San Francisco Bay Area, California, USA) founded back in 2003 by Dr.

John Cioffi, emeritus professor in Stanford University, whose work has helped to universalize

broadband technologies. The company vision is to apply Artificial Intelligence (AI) methods to

improve global internet connectivity by orders of magnitude. Consequently, its mission is to build

products that will make Internet connections run more reliably and faster.

ASSIA Inc. activity is currently focused in Operation Support Systems (OSS) development. Since its

foundation, the company has always made a strong research effort, developing innovative products

and technologies that have resulted in up to two hundred (200) patents, most of them registered

worldwide, related to broadband technologies. Additionally, successful products have been

developed, based on that IP (Intellectual Property). One of those products is Expresse® Suite, an

OSS to dynamically manage DSL (Digital Subscriber Lines) and PON (Passive Optical Network)

access networks.

Expresse Suite and all other ASSIA products target Telco Sector, so its main customers are Telcos .

Currently ASSIA Inc. is actively offering its products and services to 35 different customers

worldwide, managing more than 100 million lines, and at the same time is trying to enter new

markets and new customers.

To summarize, ASSIA has plenty of platforms to maintain and monitor (both for own usage, and

related to products sold), having significant operational costs caused by this activity.

The purpose of this project is to find a way to minimize above mentioned costs, so that ASSIA Inc.

would be able to use its resources more efficiently, as well as customer care will be improved by

Enterprise architecture automation with opensource tools 17

Enterprise architecture automation with opensource tools

applying standardized monitoring processes that help the platform work on optimal conditions and have the best performance.

For that purpose, plenty of tools exist on the market. This document will include a market research so that, considering project requirements and restrictions, right tools will be chosen. Additionally, two phases of this project will be of special relevance, which are the KPI definition phase, in which specific indicators will be pointed out and monitoring processes will be settled down, and the architecture phase, where different approaches will be designed to meet both customer and ASSIA needs. Finally, a process to automatically deploy monitoring environments will be implemented.

Next chapter introduces Expresse suite and describes relevant aspects of Expresse product.

Additionally, it shows relevant concepts considered, contains market research performed, and focuses on specific products.

18 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

2 Precedents/Technological Framework

This chapter tries to settle down technologic concepts and knowledge that will be needed along the project, to fully understand its scope and to understand chosen approach.

2.1 Expresse suite outline

When it comes to describing Expresse suite, it can be said that it is a Dynamic Line Management

(DLM) software system which collects data from fixed-access network equipment, mainly Digital

Subscriber Line Access Multiplexers (DSLAMs) and Optical Line Terminals (OLTs), also to

MSANs (Multiservice Access Node). With collected data, Expresse can:

 determine Quality of Service (QoS) at physical level of each line managed

 perform diagnostics, depending on the technology, in each line or its external plant elements,

being able to detect and locate impairments that may be impacting customer experience,

specifying those impairments

 Perform automatic operations for each customer’s line to improve its QoS and rate (in case

of DSL technologies)

 Recommend the ISP to send a field technician to customer premises to perform fixes, when

needed.

 Recommend service upgrade o service downgrade, based on expert algorithms.

 Perform real-time operations

Those points summarize core functionality but, in general, Expresse suite can be considered as a versatile toolbox to perform complex network operations easily and efficiently, orchestrating

Enterprise architecture automation with opensource tools 19

Enterprise architecture automation with opensource tools

different activities on equipment of a number of technologies and different vendors, producing a global positive impact on QoS and line rate, having as outcome an increased customer satisfaction.

To enumerate some of the benefits:

 Strong reduction in access network maintenance costs

 Significant reduction in customer complaints

 Reduction in customer churn due to bad experience

 Increment of Average Revenue Per User (ARPU)

2.1.1 Expresse architecture design

On next sections, some relevant concepts considered during architecture design are depicted.

a) User types

ISPs operation needs consolidated procedures to be able to implement services that can be sold to its customers. This way, we can find processes like service activation, service monitoring, inventory registration, periodic reporting, customer care, marketing campaigns, etc. These different activities are normally performed by different teams, of different professional disciplines. Most of them can find Expresse suite is an excellent tool to improve overall operation efficiency.

Best practices when designing Expresse architectures recommend analyzing system users, by looking at the following aspects:

 activities they will need to do

 size of their teams

 frequency of these operations

 distribution of these operations

20 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Also, it is recommended to identify the type of users. In general, some types or groups of users can be defined in any ISP:

 Front Office - Level 1 of support: This is the group of users belonging to a call center,

where customer calls with complains are first attended. This group may have from 500 to

many thousands of users. Their technical knowledge may be basic and may have low

professional qualification, which goes normally with a high personal rotation rate. Regarding

their activity, normally all their tasks are strictly guided, and many operations may be

restricted for them. Due to team size, computational load to support this group may be high,

and system high availability may be required, since processes performed by this group are

normally critical, due to operations volume.

 Backoffice - Level 2 or Level 3 of support: This group of users are technicians normally

with a deep technological expertise, normally engineers. Its team size may not be very high

(between 10 and 50 people).

 Maintenance groups: This group is specialized in performing preventive works with high

impact on QoS. They are field technicians with hands-one experience on fixing faults in

outside plant, or even in customer premises. Its size can range from 50 to 300 users.

 Marketing groups: This group is specialized in performing marketing campaign to offer

existing customers an upgrade in their service, to find new customers, to offer discounts if

service is not meeting customer contract specifications, etc. This group can range from 20 to

100 users.

 Other Operations or Business Support System (OSSs/BSSs): Expresse may be integrated

with other support systems via Northbound API (NAPI), which may have different purposes.

They can be from robots performing automatic nightly tasks, to call centers from different

Enterprise architecture automation with opensource tools 21

Enterprise architecture automation with opensource tools

regions, or integrations with local regulators performing service quality investigations. A

generic integration architecture diagram can be found in Figure 28. Generic Expresse

integration architecture.

b) Network types

Another important fact that must be considered when designing Expresse architecture is the sizes of the network being managed, in terms of lines to be managed. We could stablish two different types of network:

 Small network: up to 3 million lines to be managed.

 Big network: more than 3 million lines

Expresse suite performs most of its operations over the whole network on daily basis, and that requires different computational resources depending on network size, as well as a different number of servers, with more specialized roles when managing bigger networks. In Figure 29. Region representation for big deployments, an example of big network can be found.

c) Customer management networks

Expresse suite can impact customer management networks. It collects information so intensively, that it requires a great bandwidth to gather all the data and store it back in database. This activity can eventually collapse management network, resulting in denial of service for other OSSs or BSSs, and overall for network operation. Therefore, management network needs to be analyzed prior to deploying this software.

22 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Management networks are different from one ISP to another. Factors that can lead to these differences are network elements dislocation, site accessibility, customer dispersion, etc. For example, LATAM countries can have a vast extension, and high customer dispersion all over the country. DSLAMs may require different types of connectivity to core network, such as radio links, satellite links, or other links that may have low bandwidth. Those cases may be considered so that

Expresse traffic shaping is configured accordingly.

d) Expresse server architecture

Expresse is normally deployed on general-purpose servers with CentOS/RHEL operating systems. It is composed of several specialized software modules. Depending on how these modules are deployed, several server roles can be defined:

 Core roles:

o Application server: Can host all Expresse software modules depending on the

architecture or the usage.

o Database server: Oracle database, used as data warehouse (collections, diagnostics,

recommendations, statistics, configuration, Authentication-Authorization-Auditing

(AAA) information, etc). It is normally attached to a storage server, which can be

located next to the server, or can be a Storage Area Network (SAN).

 Optional roles: Apply only for application servers

o Web server: In this case, just Tomcat application server is deployed, normally to

separate this module, which acts as the application fronted, from its backend, where

Enterprise architecture automation with opensource tools 23

Enterprise architecture automation with opensource tools

other software modules run. This is normally done when customer security policies

require it.

o Remote DcPc (Data-collection Profile-change) server: This server can host DcPc

software module, which is the gateway for Expresse operations which are done

towards the equipment, no matter its nature. Normally it is considered as South-

Bound interface.

o Storage server: data layer might be separated from data access layer. In this case, we

can find deployments having a set of servers accessing to a centralized repository of

data.

o Stand-by database storage server: Storage server in stand-by data center.

A simple architecture may contain two servers, corresponding to an application and a database server respectively. This basic deployment is valid for small networks. For big networks, specific DcPc servers may be deployed, defining collection regions, to segment networking communication activity; additionally, more application and database servers may be required to support related computational requirements. See Figure 30. Full-tier application server and Figure 31. Role-specific application servers for a brief description of most common architecture scenarios for application servers. See Figure 32. Expresse simple DB architecture and Figure 33. Expresse complex DB architecture to understand different roles available for DB servers. Review Figure 34. Expresse DB servers synchronized with Oracle Data Guard to have a quick sight on database redundancy scenario with data replication in remote storage server. When Expresse becomes systemic, an even more complex architecture is normally required, as explained in Figure 35. Most complex Expresse architecture.

24 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

2.1.2 Expresse hardware

As mentioned in previous sections, Expresse suite can be deployed in general-purpose servers. Those servers can be physical machines, or virtual machines (VM) generated with virtualization frameworks like VMWare or OpenStack. In both cases, Expresse is able to operate, although to guarantee Expresse platform performance, servers must meet Expresse performance requirements for the following points:

 Disk space

 Disk input/output operations per second per second (IOPS), relevant for DB servers.

 RAM memory

 CPU processing capacities

2.2 Third party tools

If ASSIA Inc. looks to reduce OPEX, the best option is to find tools that help on setting up a framework able to monitor all relevant information, and tools to show that information in an executive way, so that a centralized deployment operation center can benefit from it. This framework must be as generic as possible.

The kind of tools that would be needed to achieve above definition, are two: monitoring tools, and IT infrastructure automation tools. With those tools, resources needed to monitoring basic (2 servers) to complex (20 servers or more) Expresse environments should be drastically reduced.

Enterprise architecture automation with opensource tools 25

Enterprise architecture automation with opensource tools

2.2.1 IT deployment automation selection process

In this section, we define IT deployment automation requirements, valued characteristics and other aspects to be analyzed:

 Product/Architecture: Below points analyze product basic information, which is considered

relevant and important to meet:

o License type: To minimize operational costs, software license costs must be minimum

to non-existent. This way, those products having open-source licenses would be

prioritized. A specific annex with light license review has been included in Annex B:

Open Source licenses of analyzed software.

o Launch date: Open-source projects are innovative, so that they help to resolve

problems from perspectives other than those used previously. Because of that,

projects launched within the last ten years will be suitable for the project.

o Last version and date: Open-source software is great, but an important risk has been

identified. It not unusual to find open-source projects which are abandoned. When a

software project is closed, and there is no community behind it, the product does not

evolve, and no bug-fixing is done, so it can represent a limitation to project needs.

Having said that, it is very important to identify those products which are still

evolving.

o Community: One of the most valuable aspects of open-source software is the activity

and size of its community. The bigger and more active it is, the better, so to benefit

from that this would be a requirement.

o Implementation language: This would not be a requirement, but a relevant aspect to

consider. When using an open-source licensed product, it is relevant to know

26 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

implementation language in case there is any specific need for the product which has

not been already implemented. This helps ASSIA Inc. to positively value new

candidates able to develop in that programming language.

o Configuration files language: An important aspect to analyze, so that standard formats

would be prioritized. This way, automated configuration and verification procedures

can be incorporated.

o User interface (UI): This would be an aspect to analyze, so that those tools having a

graphical user interface (GUI) would be prioritized against those having a command-

line interface (CLI). This would help to reduce the learning curve.

o RAM used by central server: To determine whether a specific server is needed for this

task, this specification must be analyzed. Software having low RAM usage would be

prioritized, so that they can be deployed on available hardware, which is also used for

other purposes.

o RAM used by agent: Memory resources needed by software agents, if existing, so that

deployment software impact estimation can be done in remote hosts

o Disk used by central server: To determine whether a specific server is needed for this

task, this specification must be analyzed. Software having low disk usage would be

prioritized, so that they can be deployed on available hardware, which is also used for

other purposes.

o Disk spaces required by agent: Disk resources needed by software agents, if existing,

so that deployment software impact estimation can be done in remote hosts.

o Requires software agent in every host: Automation tools must be easily deployed. A

relevant characteristic is to know if every single server comprising any Expresse

Enterprise architecture automation with opensource tools 27

Enterprise architecture automation with opensource tools

product platform must have a specific agent installed, as part of the deployment

automation software architecture.

o Type of nodes/elements: If the deployment software requires having nodes with

different roles.

o Host interconnection method: This point must be analyzed to evaluate so that impact

on network flow policies can be determined. Those products having the lower impact

would be prioritized.

o Scalability: Whether the software supports Expresse-like platforms, from simple to

complex. Scalable products would be prioritized.

o Ease of use / learning curve: The time it takes to master deployment software use

impacts directly on ASSIA Inc. operational costs. Products that are easy to learn

would be prioritized.

o Extensibility: Complex deployments may require from many different software

capabilities. Those products having software extensions, plugins, addons, or any other

kind of feature expansion possibilities would be prioritized.

o Dependencies: As said above, deployment automation central server software may be

installed on multi-purpose servers. To reduce ASSIA Inc. operational costs, those

products having less software dependencies would be prioritized.

 Framework usage

o Allows role-based controls: Impact of deployment automation tools can be high, since

they can perform automated changes on many frameworks. Those products having the

possibility to define deployment user roles, to set specific operations that can be

performed by any of them, would be prioritized.

 Deployment automation

28 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o Application management: Those products being able to start/stop applications that

have been deployed would be prioritized.

 Orchestration

o Allows orchestration: Complex deployments may require orchestration between

different deployment processes, so those products allowing orchestration would be

prioritized.

o Continuous configuration monitoring/fixing: Deployment configuration compliance is

important in phases other than initial product installation. So those products being

able to check whether expected configuration fits current configuration in already

deployed systems would be prioritized, as well as those products being able to fix

found inconsistencies.

o Configuration compliance reports: As a complement of previous specification, it

would be valuable for the product to generate reports of configuration compliance, so

that ASSIA Inc can determine the level of compliance their products have compared

to the expected configuration set.

o File blacklist / whitelistlist / etc.: For some operational situation, it may be needed to

blacklist some specific configurations or some specific nodes. Having this capability

would results in product being prioritized.

o User/group management: Whether deployment automation products can handle

user/group management, since it is part of product deployment

procedure.

Enterprise architecture automation with opensource tools 29

Enterprise architecture automation with opensource tools

o File template: For some specific processes, it may be required to work with specific

templates that can later be customized for specific customers. Product supporting this

would be prioritized.

o Notifications: Whether software deployment can notify issues founds by any

notification remote procedure, that is, an automated offline notification sent by email,

SNMP trap, etc.

o Automatic Inventory: Some products can discover automatically those nodes that may

be included under deployment automation architecture. This would help to reduce

configuration requirements and improve scalability management.

o Automatic Configuration rollback: Configuration templates may be wrong. Allowing

an easy transition to a previous configuration version would be a requirement.

o Configuration replication from specific node: whether deployment automation

software supports automatically configuration by telling new nodes to replicate the

configuration of existing ones.

 Service monitoring

o Ensures a service/process is running: Applications or services not being part of

configured deployment procedures may also need to be managed and monitored.

o Supports maintenance mode: For maintenance windows, sometimes it is needed to

stop automated tools operation so that manual changes can be done.

 Ad-hoc task execution

o Allow custom task execution: in many cases, Expresse deployment may require the

execution of non-standard processes, like a sequence of commands to get info and

make decisions depending on its results.

 SQL Database Management

30 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o SQL Database Management: Expresse deployment procedures also need to execute

SQL scripts. Also, they would need to start/stop databases, o even to automate some

activities, related to Oracle database management.

 Type of hardware supported: Whether deployment automation tools support working in

VMs, physical hardware, cloud environments, etc.

2.2.2 Monitoring software requirements

In this section, a number of requirements related to monitoring needs are provided, so that a market analysis can be performed.

 Product/Architecture: Below points analyze product basic specifications information

considered relevant and important to meet:

o License type: To minimize operational costs, software license costs must be minimum

to non-existent. This way, those products having open-source licenses would be

prioritized..

o Launch date: Open-source projects are innovative, so that they help to resolve

problems from perspectives other than those used previously. Because of that,

projects launched within the last ten years with be suitable for the project.

o Last version and date: Open-source software is great, but an important risk has been

identified. It not unusual to find open-source projects which are abandoned. When a

software project is closed, and there is no community behind it, the product does not

evolve, and no bug-fixing is done, so it can represent a limitation to project needs.

Enterprise architecture automation with opensource tools 31

Enterprise architecture automation with opensource tools

Having said that, it is very important to identify those products which are still

evolving.

o Community: One of the most valuable aspects of open-source software is the activity

and size of its community. The bigger and more active it is, the better, so to benefit

from that this would be a requirement.

o Implementation language: This would not be a requirement, but a relevant aspect to

consider. When using an open-source licensed product, it is relevant to know

implementation language in case there is any specific need for the product which has

not been already implemented. This helps ASSIA Inc. to positively value new

candidates able to develop in that programming language.

o Configuration files language: An important aspect to analyze, so that standard formats

would be prioritized. This way, automated configuration and verification procedures

can be incorporated.

o User interface (UI): This would be an aspect to analyze, so that those tools having a

graphical user interface (GUI) would be prioritized against those having a command-

line interface (CLI). This would help to reduce respective learning curve.

o RAM used by central server: To determine whether a specific server is needed for this

task, this specification must be analyzed. Software having low RAM usage would be

prioritized, so that they can be deployed on available hardware, which is also used for

other purposes.

o RAM used by agent: Memory resources needed by software agents, if existing, so that

deployment software impact estimation can be done in remote hosts

o Disk used by central server: To determine whether a specific server is needed for this

task, this specification must be analyzed. Software having low disk usage would be

32 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

prioritized, so that they can be deployed on available hardware, which is also used for

other purposes.

o Disk spaces required by agent: Disk resources needed by software agents, if existing,

so that deployment software impact estimation can be done in remote hosts.

o Requires software agent in every host: Automation tools must be easily deployed. A

relevant characteristic is to know if every single server comprising any Expresse

product platform must have a specific agent installed, as part of the deployment

automation software architecture.

o Type of nodes/elements: If the deployment software requires having nodes with

different roles.

o Host interconnection method: This point must be analyzed to evaluate so that impact

on network flow policies can be determined. Those products having the lower impact

would be prioritized.

o Scalability: Whether the software supports Expresse-like platforms, from simple to

complex. Scalable products would be prioritized.

o Ease of use / learning curve: The time it takes to master deployment software use

impacts directly on ASSIA Inc. operational costs. Products having steeper learning

curves would be prioritized.

o Extensibility: Complex deployments may require from many different software

capabilities. Those products having software extensions, plugins, addons, or any other

kind of feature expansion possibilities would be prioritized.

o Dependencies: As said above, deployment automation central server software may be

installed on multi-purpose servers. To reduce ASSIA Inc. operational costs, those

products having less software dependencies would be prioritized.

Enterprise architecture automation with opensource tools 33

Enterprise architecture automation with opensource tools

 Data visualization capabilities

o Visualization software type: Whether data visualization software is a desktop

application, or a web application. This may be relevant to analyze platform access.

o Visual/interactive Dashboards: The monitoring software may have customizable

dashboards and interactive Dashboards so that any information being monitored can

be added, and additional information can be obtained from the Dashboard when

interacting with it.

o Dashboard templates: By using them, all monitoring platforms will be homogeneous,

and any fix or improvements on any of them would be easily propagated to any other

platform.

o Display format: The information must be deployed on plots, widgets, tables,

depending on the data.

o Historical information display: The tool must be able to show historical information

for the last year

o Data collection granularity: Monitoring data may have different collection schedules,

depending on data source.

o Zoom capability: All plots representing data may have zoom capabilities, so that if the

user wants to focus in specific periods of time, zoom to that period can be performed.

34 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o Supports maintenance mode: For maintenance windows, sometimes it is needed to

stop monitoring system operation so that manual changes can be done not

contaminating data monitoring display.

o Thresholds setting: The system may support threshold representation for plots, so that

when needed, specific colors may be applied accordingly.

 Services monitoring

o Processes uptime: Monitoring system may report how much time a monitored service

is up and running.

o Memory usage: It may be relevant to monitor memory usage, so that patterns can be

analyzed, and abnormal situations can be detected. Also, full garbage collection time

(FGCT) may be also monitored, since it is an indicator of performance issues.

 Data source

o SQL: Monitoring procedures may also need to extract data from database by using

SQL scripts. Also, they would need to start/stop databases, o even to automate some

activities, related to Oracle database management.

 Monitoring data storage

o Data storage may be relevant, so that it can be external to Expresse, or internal. Any

of these approaches have advantages and disadvantages that should be analyzed

 Type of hardware supported: Whether monitoring tools support working in VMs, physical

hardware, cloud environments, etc.

2.3 Highlights of selected tools

Enterprise architecture automation with opensource tools 35

Enterprise architecture automation with opensource tools

Considering the criteria depicted in previous sections, a deep analysis (see chapter “Proposal description

” for details) over different tools was made. Among the different characteristics to evaluate between the candidates, it has been given an extra weight to the following:

 simplicity of architecture: the simpler the tool is, the better. Simpler architectures have

lower operating costs

 deployment complexity: For deployment automation tool, agent-less architectures have been

better received. For IT monitoring tools, having the possibility of customizing the monitoring

framework in deployment time has also been considered positive.

 community size and activity: this characteristic is crucial, since an open source tool depends

completely on the activity of its community.

Considering these points and those from full analysis, Ansible has been chosen for deployment automation tool and Grafana for IT monitoring.

36 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

3 Design specification and restrictions

This section depicts specification and restrictions of the final approach. 3.1 Global specifications

 Monitoring system must run on CentOS/RHEL systems:

 Monitoring system must operate automatically. Data collection and plots data updating

should not require any human intervention.

 The solution must enable automatic deployment and configuration of monitoring tools of

Expresse products. Templates must be used for customization.

 Solution must allow basic Key Performance Indicators (KPI) monitoring regarding hardware

performance (CPU, RAM, disk usage, disk IOPS, network usage, etc), as well as Expresse-

specific KPIs.

 System must be able to gather data from different ASSIA customers. It must support single

customer monitoring (exclusive mode), as well as multi-customer supporting (multi-tenant

mode).

 Monitoring information must be homogeneous between customers and may be represented in

executive dashboards. Customer-specific monitoring information will be placed in other

dashboards.

3.2 Generic restrictions

System restrictions are listed below:

 The system must have direct connectivity with environments to be monitored, and with all its

servers

Enterprise architecture automation with opensource tools 37

Enterprise architecture automation with opensource tools

 Monitoring software to be deployed in any server belonging to an Expresse deployment, may

not use more than 10% of RAM memory, neither more that 10% of available CPU.

 Monitoring and automation software may not have restrictive software licenses, or any

license having any cost.

 The project will be initiated, planned, executed, monitored and controlled exclusively by

ASSIA Inc. employees.

3.3 Business process KPI specification

There are four different groups of KPIs that should be monitored:

 Expresse Network Statistics: DSL/PON network-wide analysis based on network statistics

Expresse module. These statistics are already available in Expresse GUI, but for convenience

of operation and monitoring team, they will have its own place in this monitoring framework.

Included statistics will be, at least, the following:

o Stability: QoS as calculated by Expresse platform. It must be separated in two plots,

each one for DSL and another for PON technologies.

o DS_SYNCH_RATE: Average downstream synch rate (only for DSL).

o US_MABR_ESTIMATED: Average downstream maximum achievable bit rate (only

for DSL).

o DS_MABR_ESTIMATED: Average downstream maximum achievable bit rate (only

for DSL).

o US_SYNCH_RATE: Average upstream synch rate (only for DSL).

o Distribution of lines per Technology: Count of provisioned lines per technology (DSL

or PON)

38 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o Distribution of DSL lines per DSLAM type: Count of provisioned DSL lines per

DSLAM model.

o Distribution of PON lines per chassis model: Count of provisioned PON lines per

chassis model.

o Distribution of DSLAMs per model - DSL: Count of provisioned DSLAMs per model.

o Distribution of PON chassis equipment per model: Count of provisioned PON chassis

per model.

 Expresse modules monitoring: Indicators to monitor module performance, as indicated in

[3], attending to its specific operation characteristics.

o Generic: Include common metrics and JVM monitoring: This group would apply to

all Expresse modules. Since they are written in Java, some aspects of the JVM can be

monitored.

o Provisioning application

. Data source type: SQL

. Execution duration per schedule: total duration of each executions in last

day

. Historical execution duration: time-series statistics with duration of

processes

. Provisioning errors per day/type: If any, the type and amount of

provisioning errors during current day

. Lines Added vs deleted vs updated per Technology / Per region: For

successful provisioned lines, percentage per type of update for each

defined region.

Enterprise architecture automation with opensource tools 39

Enterprise architecture automation with opensource tools

. DSLAMs Added vs deleted vs updated per Technology / Per region: For

successful provisioned DSLAMs, percentage per type of update for each

defined region.

. External plant lines Added vs deleted vs updated per Technology: For

successful provisioned records, percentage per type of update and

technology.

. Lines with disabled PE/PO (%): Percentage of lines that have diagnostics /

optimizations capabilities disabled of total provisioned

o PE

. Data source type: SQL

 Duration per main processes (PE DSL, LINE SUMMARY, PO

Trigger, SR, PE PON, PON LINE_SUMMARY, PON SR)

 DSL Instability: Percentage of DSL lines for which QoS is degraded

(stability indicator in “UNSTABLE” or “VERY UNSTABLE” level).

 DSL Instability 1 month: Percentage of DSL lines that last 30 days

experienced instability more than 30% of time..

 PON Instability: Percentage of PON lines for which QoS is degraded

(Link quality indicator in “SEVERELY DEGRADED” or “SERVICE

INTERRUPTED” level).

 Results distribution analysis:

o DSL

. % of DSL lines with diagnostics calculated running day,

per source type (POP_O/PerTone)

. % lines with PE per line card type

40 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

. % lines with PE per dslam type

. % lines without PE per CPE_type: statistics per

DSLAM type and line card type on CPEs not having

diagnostics, when more than 30 % of cases are

affected, if existing for more than 100 customers.

. Dispatch recommendation analysis

 % DSL lines with dispatch recommended per

type of recommendation

. PO Trigger

 % of lines that were evaluated by PO trigger,

and action was needed

 Number of lines that were sent to PO, by reason

(code with description)

 Number of lines that failed when being sent to

PO, by reason (with description)

. SR

 % of active lines with SR results: The number of

provisioned lines having service

recommendations.

 % of SR faults by code: SR executions that

ended up in error.

o PON

. % PON lines with PE per line_card_type

. % PON lines with PE per dslam_type

Enterprise architecture automation with opensource tools 41

Enterprise architecture automation with opensource tools

. % PON lines with PE per DSLAM type, line card type

and CPE_type: Useful to detect interoperability issues

. % PON lines with PE per CPE_type

. Dispatch recommendation analysis

 % PON lines with dispatch recommended per

type of recommendation

 % lines with dispatch recommended per location

(different than drop)

o PO. Extended information can be found in [5]

. Data source type: SQL

 Process duration

 PO process indicators:

o Lines exiting PO: The number of lines for which optimization

process has ended today.

o Lines entering PO: The number of lines for which optimization

process has started today.

o Lines already in PO: The number of lines for which

optimization process started before today.

o Lines in PO by source type: The number of lines that are

currently in PO, by type of requester.

o Total lines per number of days being optimized:

 PO Profile change efficiency:

o % of successful profile changes:

o % of profile change failures per type:

42 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o PO profile changes per hour: number of requests executed per

hour

 For lines that ended optimization (PO_RECORD):

o Avg number of iterations per STABILITY_START

o Avg number of iterations per DS_STABILITY_START

o Avg number of iterations per US_STABILITY_START

o Distribution of number of profile changes, per

STABILITY_START

o Avg % improvement in DS rate

o Avg % improvemet in US rate

o ReqGen

. generic JVM monitoring

. Start time of each collection block: The point in time every single activity

generated by ReqGen started.

. Duration time of generating schedules: The time it takes for a ReqGen block

to be executed.

o DBLoader

. Only generic JVM monitoring

o DcPc

. Data source type: SQL

 Collection distribution along the day per REQUEST_TYPE: Number

of requests processed by DcPc during last 15 minutes.

 Collections efficiency

o DSL

Enterprise architecture automation with opensource tools 43

Enterprise architecture automation with opensource tools

. Per DSLAM collections. One graph representing:

 % of DSLAMs with connectivity issues per

region  Those network elements that are not

reachable from the platform during last 3 days.

. Per LINE (Collections):

 LINE_CARD

o % of lines with null no line card

information: If line card has not been

collected, there will be connectivity

issues with network elements. Besides,

this can impact other Expresse processes,

like PO.

 POP_O

o DSLAM count per % of lines with

POP_O collections: A representation of

previous day collections, classifying

DSLAMs according to the % of lines in

those DSLAMs that have collections.

o % of lines with POP_O each hour.

 POP_P

o % of lines with POP_P each hour.

 PER_TONE

o % of lines with PER_TONE_DATA each

hour and its validity for PE: not all data

44 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

collected showing operational status of

each tone can is eligible to be used for

performace evaluation.

 BULK_VENDOR_ID

o % of lines with BULK_VENDOR_ID:

Availability of information if customer

premises equipment

. Line operations:

 Port management operations: Number of

operations per type per hour

 Port management efficiency: Result of

operations per result code per hour

 Profile changes: Number of operations per type

per hour and source type

 Profile change efficiency: Result of operations

per result code per hour

o PON

. LINES:

 Accumulated % of lines with PON operational

data collected each hour

 Accumulated % of lines with PON performance

data collected each hour

Enterprise architecture automation with opensource tools 45

Enterprise architecture automation with opensource tools

 Accumulated % of lines with PON customer

premises equipment related information, each

hour

. OLTs:

 Accumulated % of OLT ports with operational

data, each hour

. Per LINE (Actions):

 PON ONT management operations per type per

hour

 PON OLT Port management operations per type

per hour

o Tomcat

. Global indicators

 Transactions per second (tps) vs avg response times last 15 min:

Transactions per second that are served by Tomcat server, and how

much time, on average, it took to served them. This indicator allows

both load monitoring and performance monitoring,

 Tps per invoked webservice and method: similar to previous indicator,

but with more detailed information, since it drills down into the

different web services that were attended.

 Error distribution last 15 min: In case there were errors when

attending requests, type of errors and count.

 Timed out request count last 15 min: Detail on the previous indicator,

focused in time-out errors, differentiating different types of time out.

46 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

 Tps + Avg response time for main users

. Dynamic reports

 Execution time for reports: The time it took for dynamic reports to

execute (start time to end time)

 Number of reports with issues in report generation

. Real Time operations

 Avg time per location (Tomcat queue vs dcpc /pe / po vs db): Time to

complete real-time operations, and average time distribution depending

on the module attending to the request.

3.4 Generic KPI specification

Expresse deployments typically have standardized architectures, with specific roles defined for each server, as defined in “Expresse server architecture” section. To monitor Expresse efficiently, we would need to consider the role of the servers to represent its metrics. Nevertheless, there are server- specific raw metrics whose monitoring could help operation and maintenance teams to detect abnormal behaviors, bottlenecks and points of failure. Those metrics are the following:

3.4.1 Generic hardware indicators

 5min load: For last 5 minutes, average CPU Load

 CPU usage: historical data, been read with a determined frequency, to be determined.

 Uptime: The time monitored server has been up and running.

 Users connected monitoring: number of users being connected at a time

 Temperature: If available (and reported), server temperature.

Enterprise architecture automation with opensource tools 47

Enterprise architecture automation with opensource tools

 Processes: Number of alive process existing in a specific point in time

 Memory Usage: RAM memory usage, and type

 Swap usage: Swap memory usage

 Disk usage (MBps + IOPS): Total amount of data transmitted or received, as well as the

measured read and write operations to the disk.

 Network traffic: per interface, amount of data transmitted and received.

With them, a global dashboard for server monitoring will be set, so that the ASSIA Inc. team can have a quick view on current (and historical) server status.

3.4.2 Database (Oracle) specific indicators

a) Generic Oracle indicators (more information in [14] and [15])

 Data source type: SQL

o pctg_active_sessions: Percentage of Oracle sessions used

o pctg_active_processes: Percentage of Oracle processes used

o pctg_active_transactions: Percentage of Oracle transactions used

o pctg_opened_cursors: Percentage of opened cursors opened.

o oracle_transactions_per_second: Transactions per second per system stats interval

o Cache hit ratio: Number of I/O requests that were satisfied in the cache / Total IO

requests

o Redo generated per second:

o Redo writes per second:

o Total_Table_Scans_Per_Sec

o Database_CPU_Time_Ratio

o Row_Cache_Hit_Ratio

48 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o Library_Cache_Hit_Ratio

o Cursor_Cache_Hit_Ratio

o CPU_Usage_Per_Sec

o Database_Wait_Time_Ratio

o Response_Time_Per_Txn

o Buffer_Cache_Hit_Ratio

o Physical_Write_IO_Requests_Per_Sec

o Physical_Read_IO_Requests_Per_Sec

o Physical Read Bytes Per Sec

o Physical_Write_Bytes_Per_Sec

o Physical_Read Total Bytes Per Sec

o Physical_Write_Total_Bytes_Per_Sec

o Physical_Reads_Per_Sec

o Physical_Writes_Per_Sec

o Redo_Allocation_Hit_Ratio

o Physical_Write_Total_IO_Requests_Per_Sec

o Physical_Read_Total_IO_Requests_Per_Sec

o I/O_Megabytes_per_Second

o I/O_Requests_per_Second

o Temp_Space_Used

o User_Transaction_Per_Sec

o SQL_Service_Response_Time

o % used space per tablespace and free space: This metric will let to know if any action

is to be taken on disks space clean-up, new datafiles to add, or disks to add.

Enterprise architecture automation with opensource tools 49

Enterprise architecture automation with opensource tools

o Number of invalid objects: Represents the count of objects that are in ‘INVALID’

status

o Number of locked objects: the count of objects of any type that are marked as

‘LOCKED’ status

3.5 Dashboard definition

For this project, we will implement a dashboard to display server metrics for specific customers.

Since this is a software platform, the operating status of servers where it is running on may be monitored. Also, this dashboard may be used both by customers and ASSIA support team. Relevant

KPIs that will be monitored are:

 This will include all servers, with subpanels per type of statistic.

o 5min load: For last 5 minutes, average CPU Load

o CPU usage: historical data, been read with a determined frequency, to be determined.

o Uptime: The time monitored server has been up and running.

o Users connected monitoring: number of users being connected at a time

o Temperatures: If available (and reported), server temperature.

o Processes: Number of alive process existing in a specific point in time

o Memory Usage: RAM memory usage, and type

o Swap usage: Swap memory usage

o Disk usage (MBps + IOPS): Total amount of data transmitted or received, as well as

the measured read and write operations to the disk.

o Network traffic: per interface, amount of data transmitted and received.

50 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

4 Proposal description

After market research looking for tools that meet all aspects described in 2.2.1 IT deployment automation selection process and 2.2.2 Monitoring software requirements chapters, a decision has been made to meet Design specification and restrictions. In next paragraphs selection and implementation processes are described.

4.1 IT deployment automation 4.1.1 Selection process and final decision

As listed in 2.2.1 IT deployment automation selection process, a number of requirements are specified for automatic deployment tool selection. Many tools have been found in the market that meet some of the required specifications, but only a group of 4 met most of the requirements. These tools are Ansible, SaltStack, Rudder and CFEEngine. A requirements compliance matrix is included in Annex D: Deployment automation tool analysis matrix.

The investigation process lasted 1 week. All information sources were mainly web pages related to each tool, as well as own investigation by testing the tools. Enough information was gathered to make a decision, Ansible becoming the selected tool.

The research process early discarded CFEEngine. The tool seemed to be very powerful and efficient, but some misalignments were found:

 It has no community version

 Very Old launch date. The product has been over there for 26 years, which in software is too

much time.

Enterprise architecture automation with opensource tools 51

Enterprise architecture automation with opensource tools

 Configuration files format is not standard

 Requires agent in any host, so initial configuration is complex

 Requires extra efforts in connectivity, which add difficulties since it would need an activity

where customer’s security team configure specific firewall rules, which can introduce

additional delays in project execution.

 Learning curve is not optimal

Next, Rudder was analyzed. It must be said that the product seems to be promising, but some aspects made it not eligible. Below most important cons:

 No big community. This tool is not widely known, and development team is small.

 Requires agent installed in any host, so initial configuration is complex

 Requires extra efforts in connectivity, like the previous tool.

For the remaining options, Ansible and SaltStack, the decision was not very clear. Both tools seemed to be as good as each other, with very good community and extensive documentation. But again, one of them required extra connectivity, and this one was SaltStack. So eventually, the chosen tool was

Ansible. The points of just needing SSH access to hosts and not requiring local agents in every node, makes it so simple to orchestrate complex Expresse deployments. Setting up an environment like the one depicted in Figure 35. Most complex Expresse architecture is just a question of a few minutes, after support team having the initial set of configurations implemented. This only weakness found was that it needs at least Python 2.6 installed on each managed server, but for ASSIA customers this is not a problem since most servers have RedHat 6.* or newer operating system (or CentOS- equivalent version), so Python 2.6 is included.

52 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

4.1.2 Ansible to automate deployments

It is not the purpose of this document to be a guide for this tool. However, for ease of understanding, some concepts will be referenced, and brief explanations will be provided:

 Ansible connectivity. To use Ansible, it is a requirement to have SSH connectivity with target

hosts.

 Ansible Inventory is the set of target hosts, groups of target hosts, variables (global, group-

specific, host-specific), and any other configurations declared in variables, that may apply for

a specific scenario. For this project, inventories will exist for each customer, as different

customers have different servers, architectures, and configurations. Centralized monitoring

scenario will also have its own set of configurations as an additional inventory.

 Ansible playbooks are the set of tasks and/or commands that must be executed in the right

sequence in order to perform an activity. This activity may be executed in specifed hosts, that

may be defined in inventory, or over defined group of hosts, or simply indicating them in a

comma-separated list of hosts. A use case for playbooks would be setting all the tasks needed

to download a specific software from a software repository, then upload it to the target

servers, install it and later alter standard configuration to meets specific needs.

 Ansible roles are playbooks written to generalize activities no matter the environment. They

have their own set of default variables and tasks. Later, roles may be imported by playbooks

that will use it for its customer-specific environment.

 variables can be custom or Ansible built-in. They can be used in playbooks or roles using

Jinja2 templating system ([17]). Normally they are used to adapt playbooks or roles to

specific needs of each environment. Its values can be inherited (from roles or groups), and

inherited values can be overwritten by playbooks, or at Ansible launch time.

Enterprise architecture automation with opensource tools 53

Enterprise architecture automation with opensource tools

Once those key concepts are described, it is needed to mention that, for this project, Ansible will be installed and used from one ASSIA Inc. server, as specified in Generic restrictions. The server will not be dedicated for this purpose, since Ansible is extremely lightweight (its Linux package only takes 53MB), and all needed configurations are stored in plain text files, which should not need much disk space. RAM and CPU requirements are minimum, so any host having 1 GB of RAM and one CPU should do the work with no issues. This way, ASSIA will not have any extra expenses caused by new infrastructure.

To end with, as Ansible will be installed in ASSIA servers, only members of corresponding customer support teams, which are all ASSIA employees, will have access to perform operations with this tool.

4.2 Monitoring Framework

Reviewing past projects at ASSIA, two monitoring tools have been used for other projects: Grafana and Kibana. We will focus on them, as that way we will ease learning curve, and will take advantage of that already available expertise. In next paragraphs we will find a brief analysis on both tools.

Kibana ([1]) is a great and fancy data visualization tool that comes which Apache 2.0 open source license. It is used to represent data stored on ElasticSearch ([6]), which is an advanced search engine based on Apache Lucene. Both tools, together with LogTash, conform the ELK stack, which has become one of the most popular tools for data analysis. With ElasticSearch, data is stored in an efficient way, allowing quick searches in huge data stores. Many kinds of information can be stored, from logs, to time-series statistics, being able to manage tons of information in an efficient way. The product scales very well, being able to cover many use-cases. Regarding system requirements, the minimum setup would be to have an 8 GB RAM server, needing fast disks and many-core CPUs. For best performance, 64 GB RAM is recommended. This seems to be not aligned with low-cost needs

54 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

of this project. Additionally, it is not likely to be compliant with one of the points mentioned in

Generic restrictions section, where it is specified that memory usage may not be over 10% of total

RAM memory; this, automatically, makes it not eligible for this project.

On the contrary, Grafana ([8]) is a lightweight web tool for data representation. It supports many data sources, including ElasticSearch engine, InfluxDB, Prometheus, Graphite and more. Also, it has a great community that has contributed hundreds of dashboards that can be used cost-free with little integration effort. It also provides a TV-mode that will be suitable for multi-customer scenario. To end with, it is easy to install and configure and works as a Linux daemon, so can be configured for automatic startup in case of server reboot. All those points make it eligible for this project.

Regarding data sources, InfluxDB has been chosen to store all data. This software, together with

Telegraf and other tools, are part of the TICK stack, developed by Influx, an open-source-oriented company. Both tools share the criteria evaluated to choose Grafana and integrate well and easily with it. It also works as a Linux daemon, so can be configured for automatic startup in case of server reboot. Telegraf is also extremely easy to deploy and configure and meets all requirements regarding monitoring tools. It is also lightweight, versatile and supports a wide variety of plugins, most of them built-in, to meet data collection needs of this project. To end with, it supports schedule customization to gather different indicators, and, as InfluxDB, works as a Linux daemon, so can be configured for automatic startup in case of server reboot.

4.3 Architectures proposal

Once automatic deployment and monitoring tools have been chosen, the next step is to define the right solution to meet project specifications. Two different architectures will be designed: “single-

Enterprise architecture automation with opensource tools 55

Enterprise architecture automation with opensource tools

customer scenario” and “multi-customer scenario”. For both scenarios, servers will be labeled with specific roles, depending on their function. Accordingly, roles will be:

 Data collection role: This role will be responsible for gathering all data corresponding to

defined KPIs. Any node having this role will need to use a tool to perform system metrics

collection, as well as Expresse-specific metrics or KPIs collection. Telegraf has been chosen

for this purpose, since it comes with many plugins that allow collection of most of the metrics

defined in 3.3 Business process KPI specification and 3.4 Generic KPI specification. For that,

the following plugins will be included and configured: cpu, disk, diskio, http, nstat, ntpq,

processes/procstat, swap, system, exec. All those plugins are documented in [9]. Exec plugin

will be used to get KPIs gathered using SQL.

 Data store role: In this case, role purpose will be to receive all collected data and store it, in

a time-series fashion, so that all data will be labeled accordingly, and a timestamp will be

added, so that any data or metric can be represented as an evolution of values in time.

Additionally, this role will also be responsible to receive data gathering request from data

representation role entities. InfluxDB will be the database used, as referenced earlier. Product

information can be found in ¡Error! No se encuentra el origen de la referencia..

 Data representation role: all data stored by previous role, will be used by this one in order

to plot it in dashboards defined in Dashboard definition section. The tool used for this

purpose will be Grafana, as mentioned earlier.

 Deployment role: responsible to orchestrate all the operations needed to deploy the different

tools and files, and execute any actions needed, in order to have the monitoring framework

installed and running with corresponding configurations.

56 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Once roles are defined, next step is to represent them into architecture proposals. Firstly, we introduce the architecture design for single-customer scenario:

Figure 1. Single-customer scenario architecture

In this architecture, it can be observed that all servers hosting monitoring framework belong to the customer and are in customer premises, and the one orchestrating automatic deployment belongs to

ASSIA Inc. Below some highlights on the architecture:

 All servers located in customer premises have data collection role. This role will be

performed by Telegraf.

 Server named “MonServ” will have both data store and data representation roles. Data

representation will be implemented with Grafana, which will be accessible using HTTP and

Enterprise architecture automation with opensource tools 57

Enterprise architecture automation with opensource tools

corresponding URL 3000. An example of this URL would be http://MonServ:3000/. Data

store role will be performed by InfluxDB, which will be listening in port 8086, both for

incoming data collected by Telegraf, and for data requests coming from Grafana tool.

 Server named “DBServ” is the one where Expresse database is installed. It will host a

collection of scripts orchestrated by Telegraf, to gather some indicators (SQL-type) specified

in 3.3 Business process KPI specification.

 Server named “DepServ” will host Ansible tool and will have direct connectivity to all servers

in customer premises by means of a site-to-site VPN. Over that VPN, SSH connectivity will

be available through a jump host, which is not represented in above design, for clarity. For

this project, Ansible has been installed directly in jump host server, to reduce ASSIA-internal

networking overhead.

Secondly, an architecture to perform centralized platform monitoring in Expresse deployments from many customers, for ASSIA Inc. internal use, has also been designed as part of this project. This was previously named as multi-customer scenario. To implement this, it is required that each customer being included, should have single-customer scenario already implemented, although data representation role is not mandatory. Architecture is slightly different from the previous scenario:

 data collection roles are not represented, although is assumed that will be deployed as part of

respective single-customer scenario.

 data representation role is implemented in ASSIA premises

 data representation role will be connecting to data store servers of each customer:

58 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 2. Multi-customer scenario architecture

Current project implements single-customer scenario.

4.4 Ansible installation

To start working with Ansible, it will be installed in DepServ. Instructions to do so can be found in

[16]. The only additional information that is worth mentioning here, is that (as indicated before) installation requires at least Python 2.6, and that requirement must be met in any host being managed. Additionally, it may require libselinux-python package in case hosts being managed have

SELinux service enabled.

Enterprise architecture automation with opensource tools 59

Enterprise architecture automation with opensource tools

4.5 Ansible roles specification

Next step is Ansible roles definition and implementation. The goal is to identify different activities to be done, enumerate corresponding tasks, sequence them, and with that information, implement

Ansible roles. Below identified activities, and execution order:

 Deploy Grafana. This activity requires performing the following tasks:

o upload corresponding package to target host

o install package once uploaded

o ensure required folders exist

o configure data sources and specific dashboards, depending on the scenario to be

implemented

o configure server to start automatically after host reboot

o start service

 Deploy InfluxDB:

o upload corresponding package to target host

o install package once uploaded

o ensure required folders exist

o configure server to start automatically after host reboot

o start service

 Deploy Telegraf:

o upload corresponding package to target host

o install package once uploaded

o Depending on whether the target host is type ServN o DBServ, apply corresponding

configuration, uploading all additional files needed for DBServ.

60 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o start service

Each role will have its own set of default variables, which will be the following:

 local path to DepServ where specific rpm is stored

 Path in target host where rpm will be uploaded

 Path to configuration templates to be used, if any

 Configuration that must be used to connect to monitoring database, if needed

4.6 Ansible inventories

For each customer, an inventory must be defined. As described before, a customer inventory may have custom configurations that only apply to the corresponding customer. Below a list of information that may be provided for each inventory:

 List of servers, grouped by server type, as specified in Architectures proposal. Servers must

be reachable from DepServ

 For each server, SSH credentials to be used

 For DB servers, Oracle credentials to be used

On the other hand, an additional inventory to implement single-customer scenario will be defined, containing information on target host (ASSIAMonServ), such as IP, port to SSH if other than standard, credentials, and any other relevant configuration.

4.7 Ansible playbooks

Enterprise architecture automation with opensource tools 61

Enterprise architecture automation with opensource tools

For each customer, a playbook will be defined to import all roles defined, configured to implement single-customer scenario. Those playbooks will be paired off corresponding customer inventories, and that way all configurations needed will be ready to use.

62 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

5 Results

To test the implementation, we have first implemented roles specified in 4.5 Ansible roles specification. Initial tests have been focused in individual roles, using specific playbooks that only imported the role to be tested. Once having this been checked, Ansible playbooks indicated in 4.7

Ansible playbooks have been implemented to test them, using virtual machines that ASSIA Inc. already had for testing purposes. A common inventory has been used for all tests. In next paragraphs, we describe all tests performed.

5.1 install_influx role testing

The purpose of this test is to check that InfluxDB software is installed properly in any target host.

These target hosts may be included in Ansible inventory, and there may belong to a group named

“monitoring_db_servers”.

 Step 1: check that software is not installed

Figure 3. install_influx role pre-execution checks

Those evidences are enough to check that software was absent in target host, which for this test is telefonica-chile-dev.

 Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev.

Enterprise architecture automation with opensource tools 63

Enterprise architecture automation with opensource tools

Figure 4. install_influx role execution evidence

Apparently, service start failed, but when checking, it can be observed that it didn’t:

Figure 5. install_influx role post-execution check

So test was successful.

Command used to install:

date && ansible-playbook -i environments/hosts playbooks/main.yml --limit monitoring_db_servers

In used inventory, monitoring_db_servers only contained target host, for testing purposes.

playbooks/main.yml was also adapted for this test.

64 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

5.2 install_telegraf role testing in normal server

For this case, the purpose of this test is to check that telegraf software is installed properly in any target host, that data is being collected, and that it is being correctly sent to InfluxDB. These target hosts may be included in Ansible inventory, and there may belong to a group named “servers”. To tell Telegraf which InfluxDB must receive collected data, playbook inventory must define influxdb_connection and variable accordingly. Also, to perform this test is required to have InfluxDB started and listening for incoming data.

 Step 1: check that software is not installed

Figure 6. install_telegraf role pre-execution tests

Those evidences are enough to check that software was absent in target host, which for this test is telefonica-chile-dev.

 Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev.

Enterprise architecture automation with opensource tools 65

Enterprise architecture automation with opensource tools

Figure 7. install_telegraf role execution evidence

In target host, we can observe that telegraf service is working:

Figure 8. install_telegraf role post-execution check (1/3)

Also, we can observe that telegraf has created a database to store its gathered data in previously installed InfluxDb:

Figure 9. install_telegraf role post-execution check (2/3)

In addition, we can see data being stored in the database:

66 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 10. install_telegraf role post-execution check (3/3)

With that there is evidenced that test was successful. Used command was: date && ansible-playbook -i environments/hosts playbooks/main.yml --limit telefonica-chile-dev

5.3 install_telegraf role testing in database server

The purpose of this test is to check that telegraf software is installed properly in any hosts where

Expresse database is installed, that expresse KPI and oracle indicators are being collected, and that it they are being correctly sent to InfluxDB. These target hosts may be included in Ansible inventory, and there may belong to a group named “db_servers”.

 Step 1: check that software is not installed

Figure 11. install_telegraf role for database server. Pre-execution check

 Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev.

Enterprise architecture automation with opensource tools 67

Enterprise architecture automation with opensource tools

The evidences gathered for this step will be split in many screenshots, since playbook has many tasks and there are many files to upload:

Figure 12. install_telegraf role for database server. Execution evidence (1/4)

Figure 13. install_telegraf role for database server. Execution evidence (2/4)

68 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 14. install_telegraf role for database server. Execution evidence (3/4)

Figure 15. install_telegraf role for database server. Execution evidence (4/4)

After that, the next checks were performed:

Enterprise architecture automation with opensource tools 69

Enterprise architecture automation with opensource tools

Figure 16. install_telegraf role for database server. Post-execution checks (1/2)

Additionally, some checks were performed in DBServ to test if Expresse and Oracle metrics were

being sent to InfluxDB:

Figure 17. install_telegraf role for database server. Post-execution checks (2/2)

So test was successful.

Command used: date && ansible-playbook -i environments/hosts playbooks/main.yml --limit telefonica-chile-db

5.4 install_grafana role testing

As before, the purpose of this test is to check that Grafana is installed properly in any target host, is

accessible and represents data from InfluxDB correctly. Target hosts may belong to group

monitoring_gui_servers.

 Step 1: check that software is not installed

70 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 18. install_grafana role pre-execution checks

 Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev

Figure 19. install_grafana role execution evidence

Once playbook run, we accessed Grafana by using a browser, with the following URL: http://telefonica-colombia-dev.assia-inc.com:3000/. There we checked that data source was correctly configured and that initial dashboard was OK, as shown in below snapshots:

Enterprise architecture automation with opensource tools 71

Enterprise architecture automation with opensource tools

Figure 20. install_grafana role post-execution checks (1/8)

Figure 21. install_grafana role post-execution checks (2/8)

72 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 22. install_grafana role post-execution checks (3/8)

Below dashboard snapshots plotting corresponding information:

Figure 23. install_grafana role post-execution checks (4/8)

Enterprise architecture automation with opensource tools 73

Enterprise architecture automation with opensource tools

Figure 24. install_grafana role post-execution checks (5/8)

Figure 25. install_grafana role post-execution checks (6/8)

74 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 26. install_grafana role post-execution checks (7/8)

Figure 27. install_grafana role post-execution checks (8/8)

So test was successful.

Command used: date && ansible-playbook -i environments/hosts playbooks/main.yml --limit telefonica-colombia-db

Enterprise architecture automation with opensource tools 75

Enterprise architecture automation with opensource tools

76 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

6 Budget As indicated several times along this document, the goal of this project is to reduce OPEX, so costs have been reduced to the minimum. This way, the budget to execute this project does not include any equipment purchase, any license costs, and any external professional services, since they have not been needed. The project has been planned, executed and closed just with ASSIA internal resources.

In the next list, resources are described:

 Project Manager. This resource was needed to perform the following tasks:

o Determine activity list

o Define project schedule

o Gather historical project documentation that might be required

o Determine required resources

o Define communications plan

 Director of Deployment Engineer: This resource was needed to:

o Approve project scope

o Resolve legal-related needs of the project

o Confirm project compliancy with ASSIA current needs.

 Delivery experts board: This team was required to:

o review solution requirements

o to oversee KPIs to be monitored

It consisted on 3 different experts, with a range of experience in Expresse deployments

and technology-specific aspects.

 Development experts board: This team was required to:

o Provide details on relevant Expresse aspects to be monitored

Enterprise architecture automation with opensource tools 77

Enterprise architecture automation with opensource tools

o Help to better focus in KPIs that should be monitored

It consisted on 4 different developers, that have a deep knowledge on Expresse

solution.

 Deployment engineer: This resource was responsible for:

o Product research

o KPI proposal

o Project documentation

o Project execution

With those resources, project budget has been calculated, resulting in a final cost of 23.300€. Further

details on budget calculation can be found in next table:

Number of Average cost per Number of Role description Final cost resources resource/day days Project Manager 1 1.000,00 € 4 4.000,00 € Director of Deployment Engineering 1 2.000,00 € 0,5 1.000,00 € Delivery experts board 3 900,00 € 1 2.700,00 € Development experts board 4 900,00 € 1 3.600,00 € Deployment engineer 1 600,00 € 20 12.000,00 € TOTAL 23.300,00 €

78 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

7 Conclusions

Once implementation steps have been finished, monitoring frameworks are set up and collections are started, Expresse monitoring task changes completely compared to previous approach. Now, data is being gathered and loaded into different InfluxDB instances deployed, and then plotted in Grafana dashboards. With those dashboards, an ASSIA engineer just need 5 minutes to review Expresse customer servers status, and that review is a much easier task. Additionally, learning curve has been amazingly reduced, since now engineers don’t need to have a deep understanding of the processes to detect issues in the platform.

The implemented solution has been also a great investment in financial terms. Before it, a deployment engineer spent half hour per day per customer to review each deployment with needed detail. Now engineers only need 5 minutes, so time reduction is 83%. With a very simple calculation, we can figure out the efficiency of this project. The following concepts allow us to simplify calculation of project efficiencies:

 C0: cost of deployment engineer resource = 600€/day

 N: Number of customers

 T0: Time it takes per day for deployment engineers to monitor Expresse, without presented

solution, in working days

 T1: Time it takes per day for deployment engineers to monitor Expresse, with presented

solution, in working days

 Tt: Total working days per year = 5 days/week * 52 weeks/year = 260 days/year

Total time an engineer used to spend, per year (in working days) for monitoring deployments of all customers is:

Enterprise architecture automation with opensource tools 79

Enterprise architecture automation with opensource tools

Tprev= N * (T1) * Tt = N * (0,5 hours / 8 hours/day) * 260 days/ year= 16,25 * N days per year

Time per day needed with new monitoring framework:

Tnew = N * T1 * Tt = N * ((5/60 hours) / 8 hours/day) * 260 days/year ~ 2,75 * N days/year

Total time saved with new framework (in working days per year):

Ttime_saved = Tprev - Tnew ~ 13.5 (days/year) * N customers

With that, total savings per year would be  Ttime_saved * C0 * N = N * 8100 € / year

As project budget is 23.300€, if ASSIA uses this new framework to monitor at least 3 customers, the investment would be recovered in less than 1 year.

Additionally, the solution implemented has set the base for future project extensions, where new indicators will be defined, contributing to automate new activities that will result in efficiency increase.

Summarizing, the main conclusion is that this approach helps ASSIA to reduce significantly OPEX related to customer platform monitoring, meeting all specifications and restrictions successfully, and with a little investment that will be recovered in a very short period.

80 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

8 Future works

For ASSIA Inc. this project set places the first stone towards efficient monitoring automation.

Having the possibility of easily deploying and configuring monitoring frameworks is good news for the company. But this project can be extended in many ways. Below we enumerate a set of proposals and brief description, to draw future work streams that will be probably implemented within this year.

 Expresse and Oracle KPI dashboard implementation. As KPIs are already defined, next

step is to define dashboards to represent mentioned KPIs. They will be designed following a

top-down approach, so that in the top of the page will have a brief summary with last values

for main KPIs and scrolling down ASSIA engineers will have much more details on every

process.

 Multi-customer environment implementation. As ASSIA grows and is awarded by new

customers, it gets more importance to have main KPIs monitored in a central platform,

allowing ASSIA to benchmark different customers easily, bringing the possibility of

detecting anomalies at high level.

 New KPI definition and corresponding dashboard modification/implementation. This

initiative has been very well received by ASSIA colleges from other departments like

Systems Engineering and have showed interest on defining new KPIs and including them in

monitoring framework, to automatically have a deep detail on different processes.

Enterprise architecture automation with opensource tools 81

Enterprise architecture automation with opensource tools

82 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

References

[1] Elasticsearch B.V. (2019). Kibana product page. Retrieved from https://www.elastic.co/products/kibana [2] ASSIA Inc. (2019). Retrieved from ASSIA Inc. web site: https://www.assia-inc.com/ [3] ASSIA Inc. (2019). Application Monitoring. Redwood City: ASSIA Inc. [4] ASSIA Inc. (2019). Expresse High Availability Setup. Redwood City: ASSIA Inc. [5] ASSIA Inc. (2019). Profile Optimizer. Redwood City: ASSIA Inc. [6] Elasticsearch B.V. (2019). Elasticsearch engine product page. Retrieved from https://www.elastic.co/products/elasticsearch [7] FOSSA, Inc. (2019). Software Licenses in Plain English. Retrieved from tl;drLegal: https://tldrlegal.com [8] Grafana Labs. (2019). Grafana product page. https://grafana.com [9] Influx data. (2019). Telegraf Input Plugins. Retrieved from Influx data web page: https://docs.influxdata.com/telegraf/v1.10/plugins/inputs [10] Influx Data. (2019). InfluxDB product page. Retrieved from https://docs.influxdata.com/influxdb/ [11] Normation. (2019). What is RUDDER? Retrieved from RUDDER: http://www.normation.com/en/rudder/what-is-rudder/ [12] Northern.tech, Inc. (2019). CFEngine product page. Retrieved from CFEngine : https://cfengine.com [13] Open Source Software Initiative. (2019). Retrieved from Open Source Software Initiative Web Site: https://opensource.org/ [14] Oracle Corporation. (2019). Retrieved from Oracle Community: https://community.oracle.com [15] Oracle Corporation. (2019). Documentation. Retrieved from Oracle Help Center: https://docs.oracle.com [16] Red Hat, Inc. (2019). Automation for everyone. Retrieved from Red Hat Ansible: https://www.ansible.com/ [17] Ronacher, A. (2019). Jinja. Retrieved from http://jinja.pocoo.org/ [18] SaltStack, Inc. (2019). Retrieved from Saltstack: https://www.saltstack.com/ [19] Influx data downloads. (2019). Influx data downloads. Retrieved from Influx data web page: https://portal.influxdata.com/downloads/

Enterprise architecture automation with opensource tools 83

Enterprise architecture automation with opensource tools

84 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Annex A: Diagrams

In this section, there can be found a set of diagrams that are referenced along this document, followed with a brief description.

Figure 28. Generic Expresse integration architecture

Above figure represents a generic scenario with Expresse integrated in customer systems architecture. On left-lower side, inventory systems provide Expresse with information of which customers they want to include in Expresse systems, where they are provisioned, which services customers are subscribed to, and how is their external plant, in terms of primary/secondary cables, terminal boxes (DSL) or ODN distribution. This information is relevant to Expresse, as it is used to map network resources (lines as existing in network elements) to business resources (any customer

Enterprise architecture automation with opensource tools 85

Enterprise architecture automation with opensource tools

identification). Additionally, trouble tickets can be provided to Expresse, to improve recommendations accuracy, as well as to ease ASSIA support team assessments.

On the right-lower side, a cloud representing network elements can be found. Expresse connects directly to them without using any NMS from equipment vendor. Communication is commonly achieved by using SNMP, although other protocols are supported.

On the right-upper side, several actors and systems can be found. Some of them are users connecting to Expresse by HTTP/HTTPs, in order to access Expresse GUI; other users may connect to Expresse indirectly, by using SOAP web-services triggered by OSSs. These users can be any of the groups defined in User types enumeration, back in Expresse architecture design section.

To end with, on left-upper side it can be found a number of users or systems receiving information from Expresse. Those systems normally receive a CSV file with the results of dynamic reports, that can contain be consumed by many different departments.

86 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 29. Region representation for big deployments

Customers having a big network offering services to more than 3 million customers may have

Expresse operations split in smaller regions. But number of customers is not the only aspect that can lead to region splitting: management network bandwidth can also require to install Expresse servers

“closer” to the region they are serving, for network bandwidth usage optimization.

Figure 30. Full-tier application server

Enterprise architecture automation with opensource tools 87

Enterprise architecture automation with opensource tools

Figure 31. Role-specific application servers

Figure 30 and Figure 31 show different grades of function-specialization for application servers, from the simplest (first one) to the most complex (last one). The latter applies to deployments of

Expresse in very big networks, over 6 million lines, although they can also be found with web and application servers integrated, but mirrored, as specified in

Figure 32. Expresse simple DB architecture

88 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Figure 33. Expresse complex DB architecture

Figure 32 and Figure 33 show different grades of function-specialization for DB servers, from the simplest (first one) to a more complex (last one). Database technology used for Expresse is Oracle

Enterprise Database. For the first case, a single server with this product installed would be enough, having just one Oracle instance accessing stored in local disks. For the second case, Oracle RAC is needed, so that many Oracle instances can be deployed on different servers all accessing same data, stored in a Storage server. A virtual IP would exist to access Oracle SCAN, which is a technology that provides a logical clustered layer that enables accessing an instance on DB, and provides a mechanism that make nodes failure transparent to users.

Figure 34. Expresse DB servers synchronized with Oracle Data Guard

Enterprise architecture automation with opensource tools 89

Enterprise architecture automation with opensource tools

Another scenario that can be found in Expresse deployments is the one represented in Figure 34. In this case, data storage is kept synchronized with a stand-by database, so that downtime related to main database failures is decreased significantly.

Figure 35. Most complex Expresse architecture

The ultimate architecture, devised for customers where Expresse becomes a systemic OSS, is showed in above figure. There it can be observed that server replication exists for all types of servers, having web server role integrated in application server roles, and n servers with application server role (3 for this example), having DcPc configured in distributor mode. DcPc role servers have also replication, with 2 defined regions having stand-by servers that would be enabled automatically in case of failure of main DcPc servers, and database replication, both for database access servers

90 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

and for data layer. Additionally, this architecture covers full-deployment replication, so in case of power outage of one data center, the other one would be ready to work in a short lapse of time.

Enterprise architecture automation with opensource tools 91

Enterprise architecture automation with opensource tools

92 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Annex B: Open Source licenses of analyzed software

In this annex, characteristics of different open-source software licenses of products considered for this project can be found. A brief description of them is included (for complete information, refer to

[7]):

 MIT: There are no restrictions in using this software, even for commercial purposes. The only

conditions are the following:

o Copyright must be included

o License must be included

o Software author cannot be held liable. This means that MIT-licensed software author

may not be responsible of any situation related to this software, or its usage.

 Apache License 2.0 (Apache-2.0): Close the MIT license. Additional points:

o Changes must be stated

o NOTICE files must be updated/appended with attribution notes any time the original

software had “NOTICE” files.

o Cannot use contributors' names, trademarks or logos.

 GNU Lesser General Public License v3 (GPL-3.0):

o Copyright must be included

o License must be included

o Install instructions must be installed in case the software is used to build (as a part of)

a consumer device

o Changes must by published/stated, and modification dates must be tracked in source

files

Enterprise architecture automation with opensource tools 93

Enterprise architecture automation with opensource tools

o References to install original software instructions, or the place where they can be

obtained, must be included

o Software author cannot be held liable

o Products cannot be sub-Licensed. This implies that any derivative work may be

redistributed only under LGPL, but applications using that library don’t have to.

o Contributors have right to practice patent claims

 GNU General Public License v3 (LGPL-3.0). This license is very close in its definition to

LGPLv3:

o In addition to LGPLv3 points, just add the point that any change done or any other

software that includes GPL-licensed code must be made available under GPL, along

with build and install instructions.

 GNU Affero General Public License v3 (AGPL-3.0). It also is very close to other GPL

licenses, with the particularity that is it is focused on network/web software. Nothing is stated

in terms of patent claims in this case

94 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Annex C: User Guide

Since the purpose of this Project is have a set of tools to automatically deploy a monitoring

framework, in this chapter we are including the different commands that may be used to achieve that,

and some variations that may apply.

1. Install Ansible

To install Ansible in DepServ, corresponding guide can be found in [16]. For this case, epel repo was

enabled so that yum package manager could be used. This way, to install the package we needed to

perform the following operations (as root user):

 Create /etc/yum.repos.d/epel.repo file with the following content:

[epel]

name=Extra Packages for Enterprise Linux 6 - $basearch

#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch

mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=$basearch

failovermethod=priority

enabled=1

gpgcheck=1

gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6

[epel-debuginfo]

name=Extra Packages for Enterprise Linux 6 - $basearch - Debug

#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch/debug

mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-debug-6&arch=$basearch

failovermethod=priority

Enterprise architecture automation with opensource tools 95

Enterprise architecture automation with opensource tools

enabled=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6 gpgcheck=1

[epel-source] name=Extra Packages for Enterprise Linux 6 - $basearch - Source

#baseurl=http://download.fedoraproject.org/pub/epel/6/SRPMS mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-source-6&arch=$basearch failovermethod=priority enabled=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6 gpgcheck=1

 Clean yum cache: yum clean all

 Install Ansible package using yum: yum install ansible

 Check that Ansible has been correctly installed: rpm -qa | grep ansible

With above steps, Ansible software gets installed.

2. Create directory structure and place corresponding files

Although it is not mandatory, most users follow best practices when using Ansible playbooks, to ease playbook sharing with community. We have tried to follow these best practices, by applying the following directory structure:

 expresse_monitoring: The root folder

o roles: a folder containing required roles

. install_influx:

 README.md: role documentation in markdown format

96 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

 tasks

o main.yml: File containing all tasks needed to deploy InfluxDB

 handlers

o main.yml: File containing handler configuration, to indicate

Ansible how to restart InfluxDB.

 defaults

o main.yml: File containing all default variables needed for this

role

 templates

o influxdb.conf: configuration template

 files

o influxdb-1.7.6.x86_64.rpm: rpm package previously

downloaded from [19]

. install_grafana

 README.md: role documentation in markdown format

 tasks

o main.yml: File containing all tasks needed to deploy Grafana

 handlers

o main.yml: File containing handler configuration, to indicate

Ansible how to restart Grafana.

 defaults

o main.yml: File containing all default variables needed for this

role

 templates

Enterprise architecture automation with opensource tools 97

Enterprise architecture automation with opensource tools

o customer_datasource.yml.j2: template that contain data source

for a specific customer, that will be used by Grafana.

o sysmetrics_dashboard.json.j2: Dashboard definition,

containing by default just one customer server showing

statistics

 files

o dashboards.yml: Generic configuration to indicate provisioned

dashboards path

o grafana-6.2.2-1.x86_64.rpm: rpm package previously

downloaded from [8]

. install_telegraf:

 README.md: role documentation in markdown format

 tasks

o main.yml: File containing all tasks needed to deploy Telegraf

 handlers

o main.yml: File containing handler configuration, to indicate

Ansible how to restart Telegraf.

 defaults

o main.yml: File containing all default variables needed for this

role

 templates

o telegraf.conf.j2: template that contain configuration for a

specific customer, that will be used by Telegraf in servers not

hosting Expresse database.

98 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

o telegraf_db.conf.j2: template that contain configuration for a

specific customer, that will be used by Telegraf in servers

hosting Expresse database

o execSQL.sh: Script to get Expresse statistics.

o execSQL_sysdba.sh: Script to get Oracle statistics.

o 15min: folder containing files with SQL queries to get statistics

gathered each 15 minutes. Files are not included in this

document for clarity.

o hourly: folder containing files with SQL queries to get statistics

gathered each hourly. Files are not included in this document

for clarity.

o daily: folder containing files with SQL queries to get statistics

gathered daily. Files are not included in this document for

clarity.

o

 files

o telegraf-1.10.4-1.x86_64.rpm: rpm package previously

downloaded from [19]

o environments: a folder containing Ansible inventories.

. hosts: Inventory used for this project

o playbooks: a folder containing ansible playbook

. main.yml: a playbook that just imports corresponding roles, depending on the

scenario to be implemented

Enterprise architecture automation with opensource tools 99

Enterprise architecture automation with opensource tools

o ansible.cfg: a file containing Ansible configuration for playbooks stored in

corresponding folder. Since it is so simple, next we will include its contents:

[defaults]

roles_path = ~/expresse_monitoring/roles display_skipped_hosts = false

3. Configure Ansible inventory

Ansible inventory is a key element in this project. We will indicate here relevant aspects that may be considered when implementing it:

 Global variables: Default values for global variables are set in order to better manage all

roles. All of them can be overridden by group variables or host variables. Below

corresponding list:

o ansible_become_pass: Superuser password

o ansible_become_method: To indicate method used by Ansible to become superuser

o ansible_user: User for Ansible to SSH

o ansible_ssh_pass: Password for SSH connection.

o download_folder: Path where Ansible will upload rpm packages.

o customers_to_monitor: All target customers for this project.

 Groups of servers: Several groups of servers may be defined, to tell Ansible where has to

perform each task easily. Below we those defined:

o servers: Those servers labeled as ServN, where Telegraf must be installed (with no

Expresse/Oracle KPIs collection). This group will have the following variables:

100 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

. ansible_user: To indicate user that Ansible must use to SSH

. ansible_ssh_pass: corresponding password

. install_telegraf: variable to tell the corresponding role that Telegraf must be

installed, not including Expresse/Oracle KPIs monitoring

o db_servers: Those servers labeled as DBServ, where Telegraf must be installed

(additionally enabling with no Expresse/Oracle KPIs collection)

. ansible_user: To indicate user that Ansible must use to SSH

. ansible_ssh_pass: corresponding password

. install_telegraf: variable to tell the corresponding role that Telegraf must be

installed, not including Expresse/Oracle KPIs monitoring

o monitoring_db_servers: Those servers labeled as MonServ or ASSIAMonServ where

InfluxDB must be installed. Corresponding vars are:

. install_influx: To indicate that this software may be installed for this group

o monitoring_gui_servers: Those servers labeled as MonServ or ASSIAMonServ,

where Grafana must be installed.

. install_grafana: To indicate that this software may be installed for this group

o customer specific group: A group of servers named with a label listed in

customers_to_monitor global variable. It must contain all servers that must be

monitored, no matter the server role. It will be used to be able to limit playbook

execution to specific customers.

4. Configure Ansible playbook

This playbook will be pretty simple, since it will just contain import commands to add tasks implemented for above defined Ansible roles. The content will be the following:

Enterprise architecture automation with opensource tools 101

Enterprise architecture automation with opensource tools

- hosts: monitoring_db_servers

tasks:

- import_role:

name: install_influxdb

- hosts: servers,db_servers

tasks:

- import_role:

name: install_telegraf

- hosts: monitoring_gui_servers

tasks:

- import_role:

name: install_grafana

5. Run Ansible playbook

Depending on the activity to be performed, commands may be different. Since this project is to enable full installation of monitoring solution for Expresse for a customer, the command to run, once every previous step is done, is the following:

ansible-playbook -i environments/hosts playbooks/main.yml --limit

must be a customer defined in Ansible inventory in customer specific group.

102 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Annex D: Deployment automation tool analysis matrix

Ansible SaltStack Rudder CFEEngine Product/Architecture License type GPL-3.0 Apache 2.0 GPLv3+ GPLv3 Launch date 2012 2011 2012 1993 Last version and date 2.7.7 5.0.1 (2018-10-19) 3.12.0(2018-06-28) Implementation language Python Python C C Yes, very active Community Yes Not very big Yes, very powerful (like live chat) Yes* (Tower Not available in UI Web/CLI/API edition) Community version Configuration files language YAML YAML YAML Own format RAM used by central server Up to 4GB >= 2 GB

max_space = number of Depends on Directives * 100MB* number of Disk used by central server playbooks number of Nodes * nodes retention duration in days * 400 kB Can work Yes, and must have Normal operation: agentless, also can connectivity with Requires Agent in every host Yes agentless have agents called root server minions somehow RAM used by agent 20MB 30 MB

Disk spaces required by agent - 500MB 256 MB

103 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Ansible SaltStack Rudder CFEEngine cf-execd (cron), cf- serverd (file server Root/Policy server and agent (PostgreSQL) + Type of nodes/elements rpm package communication), Agent (+ relay cf-monitord( server) configuration check), cf-agent Method to connect hosts SSH SSH tcp 5309 Yes: tcp 5308 Yes: 514 (tcp/udp), Hosts must connect server tcp 4505/4506 Yes: tcp 5308 tcp 5309, tcp 5310 Yes, up to 250 Yes, can even have Yes, by installing Scalability nodes (community Very scalable SW load balancing the agent ed) Own command Not a one-shot Requires a steep Ease of use Very ease interface deployment tool learning curve Yes. Many modules Yes, many modules Very extensible, Extensibility available for task available plugins for AWS automation Vagrant - VMWare Agents require Dependencies >=Python v2.6 - VirtualBox syslogd Very fast. Can work Provides platform with intermediate abstraction, with nodes that spread minions used Other comments configs to other locally to convert nodes not instructions to local accesible to the system first one Framework usage

104 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Ansible SaltStack Rudder CFEEngine Allows role-based controls No Yes No Deployment automation Application management Yes Yes Yes Yes Orchestration Yes Allows orchestration Yes Yes Yes Yes Configuration Management Continuous configuration Yes Yes Yes Yes monitoring/fixing

File blacklist/greylist/etc Yes Yes Configuration compliance Not available in Yes Yes* Yes reports community version User/group management Yes Yes Yes Yes File template Yes Yes No Notifications No Yes, via reporting No Automatic Inventory Yes Yes No Automatic Configuration Yes Yes ? rollback Configuration replication from Manual Yes ? specific node Services monitoring Yes Ensures a service/process is Yes Yes? Yes Yes running

Supports maintenance mode Yes Yes ? Resources monitoring Yes Allow HW resources Yes No monitoring

105 Enterprise architecture automation with opensource tools

Enterprise architecture automation with opensource tools

Ansible SaltStack Rudder CFEEngine Allow Net monitoring No

Allow disk monitoring yes No

Ad-hoc task execution Yes Allow custom task execution Yes yes ? No SQL Database Management SQL Database Management Yes Yes ? Yes Type of environments Yes, -cloud and Virtualización + cloud Yes salt-virt

106 Enterprise architecture automation with opensource tools