ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA Y SISTEMAS DE TELECOMUNICACIÓN PROYECTO FIN DE GRADO
TÍTULO: ENTERPRISE ARCHITECTURE AUTOMATION WITH OPENSOURCE TOOLS
AUTOR: FERNANDO GARZÁS MARTÍN DE ALMAGRO
TITULACIÓN: GRADO EN INGENIERÍA TELEMÁTICA
DIRECTOR: ALEJANDRO ALONSO FERNANDEZ TUTOR: CARLOS RAMOS NESPEREIRA
DEPARTAMENTO: INGENIERÍA TELEMÁTICA Y ELECTRÓNICA
VºBº
Miembros del Tribunal Calificador:
PRESIDENTE: MARTINA ECKERT
TUTOR: CARLOS RAMOS NESPEREIRA
SECRETARIO: FRANCISCO JAVIER RAMIREZ LEDESMA
Fecha de lectura:
Calificación:
El Secretario,
Enterprise architecture automation with opensource tools
2 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Resumen
Cualquier compañía cuyo negocio se base en el desarrollo y la venta de productos software, debe garantizar el correcto desempeño y rendimiento de los productos que entrega a sus clientes. Para ello, es fundamental realizar una correcta monitorización de las herramientas, lo cual implica un coste operativo que en determinadas ocasiones puede ser elevado. Este coste operativo puede ser asumido por la compañía proveedora, o por el cliente final.
En el caso del producto Expresse ®, de la compañía ASSIA Inc. ([2]), esta labor de monitorización se realiza de forma manual por los ingenieros del equipo de soporte de cada cliente, y no se cuenta con un sistema software que permita realizar una monitorización eficiente de los diferentes procesos de negocio ejecutados por el software, ni una correcta monitorización de las plataformas en las que los productos están desplegados. En el presente proyecto se han elegido herramientas que minimicen ese coste operativo, y en base a ellas se ha implementado un sistema de monitorización automáticamente desplegado.
Como parte del trabajo, se han determinado los indicadores clave de procesos (KPI) del producto
Expresse que deben ser monitorizados. Igualmente, se han determinado los indicadores de rendimiento de los servidores pertenecientes a la plataforma que resulten ser relevantes para la operación de esta. Por otro lado, se han determinado los tipos de arquitecturas que deben ser monitorizados, resultando dos: arquitecturas mono-cliente, y arquitecturas multi-cliente.
Con respecto a las herramientas de monitorización, se han evaluado dos: Kibana ([1]) y Grafana
([8]). Finalmente, se ha seleccionado Grafana, al estar siendo utilizada por otros departamentos dentro de la empresa, facilitando así simplificar los procesos de monitorización al tener una plataforma global de monitorización, que incluye otros productos aparte de Expresse.
Enterprise architecture automation with opensource tools 3
Enterprise architecture automation with opensource tools
Así mismo, se han analizado cuatro herramientas de automatización de despliegues: Ansible ([16]),
SaltStack ([18]), Rudder ([11]) y CFEEngine ([12]). De éstas, ha sido seleccionada Ansible, por la simpleza de su arquitectura, su versatilidad y escalabilidad, y su adecuación a los tipos de arquitecturas que se deben gestionar.
4 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Abstract
Software companies developing their own products need to ensure product operation quality and performance after being handed out to their customers. Given that, it is crucial to monitor effectively.
The cost of resources to do is an operational expense, which sometimes may be high. These costs might be afforded either by software provider, or by customer itself.
In case of Expresse ®, an ASSIA Inc. product ([2]), this monitoring work is currently performed manually by ASSIA support team engineers of each account and does not have any tool to efficiently perform either business process monitoring or platforms monitoring. In this project, tools have been identified to help reduce those operational expenses and an auto-deployed monitoring system has been implemented.
As part of the work, KPIs have been identified both for business process monitoring and for platform monitoring. In parallel, two different architectures have been defined: single-customer and multi- customer.
Regarding monitoring activity, two tools have been evaluated: Kibana ([1]) and Grafana ([8]). The latter has been chosen as it is being used on other ASSIA processes to monitor another family of products (Cloudcheck). This way, global monitoring processes are simplified. In addition, learning curve for those systems is reduced.
Additionally, four different deployment automation tools have been evaluated: Ansible ([16]),
SaltStack ([18]), Rudder ([11]) and CFEEngine ([12]). From them, Ansible has been chosen, due to its simplicity of operation, its versatility, scalability, and its suitability to the different architectures to be set up.
Enterprise architecture automation with opensource tools 5
Enterprise architecture automation with opensource tools
6 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Revision History
Part Number Date Comments TR-XX-YYYY- Document created with first page meeting DD-MM UPM requirements. Included abstract Included first version of Table of Contents, Acronym list, and introduction Included Introduction Included Technological Framework. 27/05/2019 Included changes from first review 28/05/2019 Added KPI definitions 10/06/2019 Improved KPI definitions. Included diagrams, compliancy and budget. Performed minor revisions. 17/06/2019 Reorganized documentation to add more details in Proposal Description chapter 19/06/2019 Added architecture diagrams. Included more details on solutions used. Added implementation details. Added user guide and conclusions. 02/07/2019 Reviewed document format. Added chapter “Future Works”
Enterprise architecture automation with opensource tools 7
Enterprise architecture automation with opensource tools
8 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Table of Contents REVISION HISTORY ...... 7 ABBREVIATIONS, ACRONYMS AND SYMBOLS ...... 13 1 INTRODUCTION ...... 17 2 PRECEDENTS/TECHNOLOGICAL FRAMEWORK ...... 19 2.1 EXPRESSE SUITE OUTLINE ...... 19 2.1.1 EXPRESSE ARCHITECTURE DESIGN ...... 20 2.1.2 EXPRESSE HARDWARE ...... 25 2.2 THIRD PARTY TOOLS...... 25 2.2.1 IT DEPLOYMENT AUTOMATION SELECTION PROCESS ...... 26 2.2.2 MONITORING SOFTWARE REQUIREMENTS ...... 31 2.3 HIGHLIGHTS OF SELECTED TOOLS ...... 35 3 DESIGN SPECIFICATION AND RESTRICTIONS ...... 37 3.1 GLOBAL SPECIFICATIONS ...... 37 3.2 GENERIC RESTRICTIONS ...... 37 3.3 BUSINESS PROCESS KPI SPECIFICATION ...... 38 3.4 GENERIC KPI SPECIFICATION...... 47 3.4.1 GENERIC HARDWARE INDICATORS ...... 47 3.4.2 DATABASE (ORACLE) SPECIFIC INDICATORS ...... 48 3.5 DASHBOARD DEFINITION ...... 50 4 PROPOSAL DESCRIPTION ...... 51 4.1 IT DEPLOYMENT AUTOMATION ...... 51 4.1.1 SELECTION PROCESS AND FINAL DECISION ...... 51 4.1.2 ANSIBLE TO AUTOMATE DEPLOYMENTS ...... 53 4.2 MONITORING FRAMEWORK ...... 54 4.3 ARCHITECTURES PROPOSAL ...... 55 4.4 ANSIBLE INSTALLATION ...... 59 4.5 ANSIBLE ROLES SPECIFICATION ...... 60 4.6 ANSIBLE INVENTORIES ...... 61 4.7 ANSIBLE PLAYBOOKS ...... 61 5 RESULTS ...... 63 5.1 INSTALL_INFLUX ROLE TESTING ...... 63 5.2 INSTALL_TELEGRAF ROLE TESTING IN NORMAL SERVER ...... 65 5.3 INSTALL_TELEGRAF ROLE TESTING IN DATABASE SERVER ...... 67 5.4 INSTALL_GRAFANA ROLE TESTING ...... 70 6 BUDGET ...... 77
Enterprise architecture automation with opensource tools 9
Enterprise architecture automation with opensource tools
7 CONCLUSIONS ...... 79 8 FUTURE WORKS ...... 81 REFERENCES ...... 83 ANNEX A: DIAGRAMS ...... 85 ANNEX B: OPEN SOURCE LICENSES OF ANALYZED SOFTWARE ...... 93 ANNEX C: USER GUIDE ...... 95 ANNEX D: DEPLOYMENT AUTOMATION TOOL ANALYSIS MATRIX ...... 103
10 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Table of Figures
FIGURE 1. SINGLE-CUSTOMER SCENARIO ARCHITECTURE ...... 57 FIGURE 2. MULTI-CUSTOMER SCENARIO ARCHITECTURE ...... 59 FIGURE 3. INSTALL_INFLUX ROLE PRE-EXECUTION CHECKS ...... 63 FIGURE 4. INSTALL_INFLUX ROLE EXECUTION EVIDENCE ...... 64 FIGURE 5. INSTALL_INFLUX ROLE POST-EXECUTION CHECK ...... 64 FIGURE 6. INSTALL_TELEGRAF ROLE PRE-EXECUTION TESTS ...... 65 FIGURE 7. INSTALL_TELEGRAF ROLE EXECUTION EVIDENCE ...... 66 FIGURE 8. INSTALL_TELEGRAF ROLE POST-EXECUTION CHECK (1/3) ...... 66 FIGURE 9. INSTALL_TELEGRAF ROLE POST-EXECUTION CHECK (2/3) ...... 66 FIGURE 10. INSTALL_TELEGRAF ROLE POST-EXECUTION CHECK (3/3) ...... 67 FIGURE 11. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. PRE-EXECUTION CHECK ...... 67 FIGURE 12. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (1/4) ...... 68 FIGURE 13. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (2/4) ...... 68 FIGURE 14. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (3/4) ...... 69 FIGURE 15. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. EXECUTION EVIDENCE (4/4) ...... 69 FIGURE 16. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. POST-EXECUTION CHECKS (1/2) ...... 70 FIGURE 17. INSTALL_TELEGRAF ROLE FOR DATABASE SERVER. POST-EXECUTION CHECKS (2/2) ...... 70 FIGURE 18. INSTALL_GRAFANA ROLE PRE-EXECUTION CHECKS ...... 71 FIGURE 19. INSTALL_GRAFANA ROLE EXECUTION EVIDENCE ...... 71 FIGURE 20. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (1/8) ...... 72 FIGURE 21. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (2/8) ...... 72 FIGURE 22. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (3/8) ...... 73 FIGURE 23. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (4/8) ...... 73 FIGURE 24. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (5/8) ...... 74 FIGURE 25. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (6/8) ...... 74 FIGURE 26. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (7/8) ...... 75 FIGURE 27. INSTALL_GRAFANA ROLE POST-EXECUTION CHECKS (8/8) ...... 75 FIGURE 28. GENERIC EXPRESSE INTEGRATION ARCHITECTURE ...... 85 FIGURE 29. REGION REPRESENTATION FOR BIG DEPLOYMENTS ...... 87 FIGURE 30. FULL-TIER APPLICATION SERVER ...... 87 FIGURE 31. ROLE-SPECIFIC APPLICATION SERVERS ...... 88 FIGURE 32. EXPRESSE SIMPLE DB ARCHITECTURE ...... 88 FIGURE 33. EXPRESSE COMPLEX DB ARCHITECTURE ...... 89 FIGURE 34. EXPRESSE DB SERVERS SYNCHRONIZED WITH ORACLE DATA GUARD ...... 89 FIGURE 35. MOST COMPLEX EXPRESSE ARCHITECTURE ...... 90
Enterprise architecture automation with opensource tools 11
Enterprise architecture automation with opensource tools
12 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Abbreviations, Acronyms and Symbols
AAA Authentication-Authorization-Auditing AGPL GNU Affero General Public License AI Artificial Intelligence API Application Program Interface ARPU Average Revenue per User ASSIA Inc Adaptive Spectrum and Signal Alignment (Incorporated) BSS Business Support System CLI Command-Line interface CPE Customer Premises Equipment CPU Central Processing Unit DBA Database Administrator dbloader Database Loader (Expresse module) DcPc Data Collection and Profile Change (Expresse module) DLM Dynamic Line Management DS Downstream DSL Digital Subscriber Lines DSLAM Digital Subscriber Line Access Multiplexer FCGT Full Garbage Collection Time FQDN Fully Qualified Domain Name GNU GNU's Not Unix GPL GNU public license GUI Graphical User Interface HA High Availability HTTP Hypertext Transfer Protocol IOPS Input/Output operations per second IP Intellectual Property IP Internet Protocol ISP Internet Service Provider JMX Java Management eXtensions JVM Java Virtual Machine KPI Key Performance Indicator
Enterprise architecture automation with opensource tools 13
Enterprise architecture automation with opensource tools
LATAM Latin-America LGPL GNU Lesser General Public License MABR Maximum achievable bit rate MIT Massachusetts Institute of Technology MSAN Multiservice Access Node NAPI Northbound API NMS Network Management System NOC Network Operations Center ODN Optical Distribution Network OLT Optical Line Terminal OPEX Operational Expense OSP Outside Plant OSPM Outside Plant Management OSS Operations Support System PE Performance Evaluator (Expresse module) PO Profile Optimizer (Expresse module) PON Passive Optical Network POP_O Pop operational data POP_P Pop performance data QoS Quality of Service RAC Real Application Cluster (Oracle Product) RAM Random Access Memory RHEL RedHat Enterprise Linux ROI Return of Investment S2S VPN Site-to-Site Virtual Private Network SAN Storage Access Network SCAN Single Client Access Name (Oracle feature) SNMP Simple Network Management Protocol SOAP Simple Object Access Protocol SQL Structured Query Language SR Service Recommender SSH Secure Shell TICK Telegraf, Influx, Chronograf and Kapacitor
14 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
TPS Transactions per second TR Technical Report TV Television UI User Interface UIWS User-Interface Web-Service URL Uniform Resource Locator US Upstream VM Virtualized Machine
Enterprise architecture automation with opensource tools 15
Enterprise architecture automation with opensource tools
16 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
1 Introduction
Adaptive Spectrum and Signal Alignment, Incorporated (ASSIA Inc.), is an American company
based on Redwood City, (San Francisco Bay Area, California, USA) founded back in 2003 by Dr.
John Cioffi, emeritus professor in Stanford University, whose work has helped to universalize
broadband technologies. The company vision is to apply Artificial Intelligence (AI) methods to
improve global internet connectivity by orders of magnitude. Consequently, its mission is to build
products that will make Internet connections run more reliably and faster.
ASSIA Inc. activity is currently focused in Operation Support Systems (OSS) development. Since its
foundation, the company has always made a strong research effort, developing innovative products
and technologies that have resulted in up to two hundred (200) patents, most of them registered
worldwide, related to broadband technologies. Additionally, successful products have been
developed, based on that IP (Intellectual Property). One of those products is Expresse® Suite, an
OSS to dynamically manage DSL (Digital Subscriber Lines) and PON (Passive Optical Network)
access networks.
Expresse Suite and all other ASSIA products target Telco Sector, so its main customers are Telcos .
Currently ASSIA Inc. is actively offering its products and services to 35 different customers
worldwide, managing more than 100 million lines, and at the same time is trying to enter new
markets and new customers.
To summarize, ASSIA has plenty of platforms to maintain and monitor (both for own usage, and
related to products sold), having significant operational costs caused by this activity.
The purpose of this project is to find a way to minimize above mentioned costs, so that ASSIA Inc.
would be able to use its resources more efficiently, as well as customer care will be improved by
Enterprise architecture automation with opensource tools 17
Enterprise architecture automation with opensource tools
applying standardized monitoring processes that help the platform work on optimal conditions and have the best performance.
For that purpose, plenty of tools exist on the market. This document will include a market research so that, considering project requirements and restrictions, right tools will be chosen. Additionally, two phases of this project will be of special relevance, which are the KPI definition phase, in which specific indicators will be pointed out and monitoring processes will be settled down, and the architecture phase, where different approaches will be designed to meet both customer and ASSIA needs. Finally, a process to automatically deploy monitoring environments will be implemented.
Next chapter introduces Expresse suite and describes relevant aspects of Expresse product.
Additionally, it shows relevant concepts considered, contains market research performed, and focuses on specific products.
18 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
2 Precedents/Technological Framework
This chapter tries to settle down technologic concepts and knowledge that will be needed along the project, to fully understand its scope and to understand chosen approach.
2.1 Expresse suite outline
When it comes to describing Expresse suite, it can be said that it is a Dynamic Line Management
(DLM) software system which collects data from fixed-access network equipment, mainly Digital
Subscriber Line Access Multiplexers (DSLAMs) and Optical Line Terminals (OLTs), also to
MSANs (Multiservice Access Node). With collected data, Expresse can:
determine Quality of Service (QoS) at physical level of each line managed
perform diagnostics, depending on the technology, in each line or its external plant elements,
being able to detect and locate impairments that may be impacting customer experience,
specifying those impairments
Perform automatic operations for each customer’s line to improve its QoS and rate (in case
of DSL technologies)
Recommend the ISP to send a field technician to customer premises to perform fixes, when
needed.
Recommend service upgrade o service downgrade, based on expert algorithms.
Perform real-time operations
Those points summarize core functionality but, in general, Expresse suite can be considered as a versatile toolbox to perform complex network operations easily and efficiently, orchestrating
Enterprise architecture automation with opensource tools 19
Enterprise architecture automation with opensource tools
different activities on equipment of a number of technologies and different vendors, producing a global positive impact on QoS and line rate, having as outcome an increased customer satisfaction.
To enumerate some of the benefits:
Strong reduction in access network maintenance costs
Significant reduction in customer complaints
Reduction in customer churn due to bad experience
Increment of Average Revenue Per User (ARPU)
2.1.1 Expresse architecture design
On next sections, some relevant concepts considered during architecture design are depicted.
a) User types
ISPs operation needs consolidated procedures to be able to implement services that can be sold to its customers. This way, we can find processes like service activation, service monitoring, inventory registration, periodic reporting, customer care, marketing campaigns, etc. These different activities are normally performed by different teams, of different professional disciplines. Most of them can find Expresse suite is an excellent tool to improve overall operation efficiency.
Best practices when designing Expresse architectures recommend analyzing system users, by looking at the following aspects:
activities they will need to do
size of their teams
frequency of these operations
distribution of these operations
20 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Also, it is recommended to identify the type of users. In general, some types or groups of users can be defined in any ISP:
Front Office - Level 1 of support: This is the group of users belonging to a call center,
where customer calls with complains are first attended. This group may have from 500 to
many thousands of users. Their technical knowledge may be basic and may have low
professional qualification, which goes normally with a high personal rotation rate. Regarding
their activity, normally all their tasks are strictly guided, and many operations may be
restricted for them. Due to team size, computational load to support this group may be high,
and system high availability may be required, since processes performed by this group are
normally critical, due to operations volume.
Backoffice - Level 2 or Level 3 of support: This group of users are technicians normally
with a deep technological expertise, normally engineers. Its team size may not be very high
(between 10 and 50 people).
Maintenance groups: This group is specialized in performing preventive works with high
impact on QoS. They are field technicians with hands-one experience on fixing faults in
outside plant, or even in customer premises. Its size can range from 50 to 300 users.
Marketing groups: This group is specialized in performing marketing campaign to offer
existing customers an upgrade in their service, to find new customers, to offer discounts if
service is not meeting customer contract specifications, etc. This group can range from 20 to
100 users.
Other Operations or Business Support System (OSSs/BSSs): Expresse may be integrated
with other support systems via Northbound API (NAPI), which may have different purposes.
They can be from robots performing automatic nightly tasks, to call centers from different
Enterprise architecture automation with opensource tools 21
Enterprise architecture automation with opensource tools
regions, or integrations with local regulators performing service quality investigations. A
generic integration architecture diagram can be found in Figure 28. Generic Expresse
integration architecture.
b) Network types
Another important fact that must be considered when designing Expresse architecture is the sizes of the network being managed, in terms of lines to be managed. We could stablish two different types of network:
Small network: up to 3 million lines to be managed.
Big network: more than 3 million lines
Expresse suite performs most of its operations over the whole network on daily basis, and that requires different computational resources depending on network size, as well as a different number of servers, with more specialized roles when managing bigger networks. In Figure 29. Region representation for big deployments, an example of big network can be found.
c) Customer management networks
Expresse suite can impact customer management networks. It collects information so intensively, that it requires a great bandwidth to gather all the data and store it back in database. This activity can eventually collapse management network, resulting in denial of service for other OSSs or BSSs, and overall for network operation. Therefore, management network needs to be analyzed prior to deploying this software.
22 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Management networks are different from one ISP to another. Factors that can lead to these differences are network elements dislocation, site accessibility, customer dispersion, etc. For example, LATAM countries can have a vast extension, and high customer dispersion all over the country. DSLAMs may require different types of connectivity to core network, such as radio links, satellite links, or other links that may have low bandwidth. Those cases may be considered so that
Expresse traffic shaping is configured accordingly.
d) Expresse server architecture
Expresse is normally deployed on general-purpose servers with CentOS/RHEL operating systems. It is composed of several specialized software modules. Depending on how these modules are deployed, several server roles can be defined:
Core roles:
o Application server: Can host all Expresse software modules depending on the
architecture or the usage.
o Database server: Oracle database, used as data warehouse (collections, diagnostics,
recommendations, statistics, configuration, Authentication-Authorization-Auditing
(AAA) information, etc). It is normally attached to a storage server, which can be
located next to the server, or can be a Storage Area Network (SAN).
Optional roles: Apply only for application servers
o Web server: In this case, just Tomcat application server is deployed, normally to
separate this module, which acts as the application fronted, from its backend, where
Enterprise architecture automation with opensource tools 23
Enterprise architecture automation with opensource tools
other software modules run. This is normally done when customer security policies
require it.
o Remote DcPc (Data-collection Profile-change) server: This server can host DcPc
software module, which is the gateway for Expresse operations which are done
towards the equipment, no matter its nature. Normally it is considered as South-
Bound interface.
o Storage server: data layer might be separated from data access layer. In this case, we
can find deployments having a set of servers accessing to a centralized repository of
data.
o Stand-by database storage server: Storage server in stand-by data center.
A simple architecture may contain two servers, corresponding to an application and a database server respectively. This basic deployment is valid for small networks. For big networks, specific DcPc servers may be deployed, defining collection regions, to segment networking communication activity; additionally, more application and database servers may be required to support related computational requirements. See Figure 30. Full-tier application server and Figure 31. Role-specific application servers for a brief description of most common architecture scenarios for application servers. See Figure 32. Expresse simple DB architecture and Figure 33. Expresse complex DB architecture to understand different roles available for DB servers. Review Figure 34. Expresse DB servers synchronized with Oracle Data Guard to have a quick sight on database redundancy scenario with data replication in remote storage server. When Expresse becomes systemic, an even more complex architecture is normally required, as explained in Figure 35. Most complex Expresse architecture.
24 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
2.1.2 Expresse hardware
As mentioned in previous sections, Expresse suite can be deployed in general-purpose servers. Those servers can be physical machines, or virtual machines (VM) generated with virtualization frameworks like VMWare or OpenStack. In both cases, Expresse is able to operate, although to guarantee Expresse platform performance, servers must meet Expresse performance requirements for the following points:
Disk space
Disk input/output operations per second per second (IOPS), relevant for DB servers.
RAM memory
CPU processing capacities
2.2 Third party tools
If ASSIA Inc. looks to reduce OPEX, the best option is to find tools that help on setting up a framework able to monitor all relevant information, and tools to show that information in an executive way, so that a centralized deployment operation center can benefit from it. This framework must be as generic as possible.
The kind of tools that would be needed to achieve above definition, are two: monitoring tools, and IT infrastructure automation tools. With those tools, resources needed to monitoring basic (2 servers) to complex (20 servers or more) Expresse environments should be drastically reduced.
Enterprise architecture automation with opensource tools 25
Enterprise architecture automation with opensource tools
2.2.1 IT deployment automation selection process
In this section, we define IT deployment automation requirements, valued characteristics and other aspects to be analyzed:
Product/Architecture: Below points analyze product basic information, which is considered
relevant and important to meet:
o License type: To minimize operational costs, software license costs must be minimum
to non-existent. This way, those products having open-source licenses would be
prioritized. A specific annex with light license review has been included in Annex B:
Open Source licenses of analyzed software.
o Launch date: Open-source projects are innovative, so that they help to resolve
problems from perspectives other than those used previously. Because of that,
projects launched within the last ten years will be suitable for the project.
o Last version and date: Open-source software is great, but an important risk has been
identified. It not unusual to find open-source projects which are abandoned. When a
software project is closed, and there is no community behind it, the product does not
evolve, and no bug-fixing is done, so it can represent a limitation to project needs.
Having said that, it is very important to identify those products which are still
evolving.
o Community: One of the most valuable aspects of open-source software is the activity
and size of its community. The bigger and more active it is, the better, so to benefit
from that this would be a requirement.
o Implementation language: This would not be a requirement, but a relevant aspect to
consider. When using an open-source licensed product, it is relevant to know
26 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
implementation language in case there is any specific need for the product which has
not been already implemented. This helps ASSIA Inc. to positively value new
candidates able to develop in that programming language.
o Configuration files language: An important aspect to analyze, so that standard formats
would be prioritized. This way, automated configuration and verification procedures
can be incorporated.
o User interface (UI): This would be an aspect to analyze, so that those tools having a
graphical user interface (GUI) would be prioritized against those having a command-
line interface (CLI). This would help to reduce the learning curve.
o RAM used by central server: To determine whether a specific server is needed for this
task, this specification must be analyzed. Software having low RAM usage would be
prioritized, so that they can be deployed on available hardware, which is also used for
other purposes.
o RAM used by agent: Memory resources needed by software agents, if existing, so that
deployment software impact estimation can be done in remote hosts
o Disk used by central server: To determine whether a specific server is needed for this
task, this specification must be analyzed. Software having low disk usage would be
prioritized, so that they can be deployed on available hardware, which is also used for
other purposes.
o Disk spaces required by agent: Disk resources needed by software agents, if existing,
so that deployment software impact estimation can be done in remote hosts.
o Requires software agent in every host: Automation tools must be easily deployed. A
relevant characteristic is to know if every single server comprising any Expresse
Enterprise architecture automation with opensource tools 27
Enterprise architecture automation with opensource tools
product platform must have a specific agent installed, as part of the deployment
automation software architecture.
o Type of nodes/elements: If the deployment software requires having nodes with
different roles.
o Host interconnection method: This point must be analyzed to evaluate so that impact
on network flow policies can be determined. Those products having the lower impact
would be prioritized.
o Scalability: Whether the software supports Expresse-like platforms, from simple to
complex. Scalable products would be prioritized.
o Ease of use / learning curve: The time it takes to master deployment software use
impacts directly on ASSIA Inc. operational costs. Products that are easy to learn
would be prioritized.
o Extensibility: Complex deployments may require from many different software
capabilities. Those products having software extensions, plugins, addons, or any other
kind of feature expansion possibilities would be prioritized.
o Dependencies: As said above, deployment automation central server software may be
installed on multi-purpose servers. To reduce ASSIA Inc. operational costs, those
products having less software dependencies would be prioritized.
Framework usage
o Allows role-based controls: Impact of deployment automation tools can be high, since
they can perform automated changes on many frameworks. Those products having the
possibility to define deployment user roles, to set specific operations that can be
performed by any of them, would be prioritized.
Deployment automation
28 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o Application management: Those products being able to start/stop applications that
have been deployed would be prioritized.
Orchestration
o Allows orchestration: Complex deployments may require orchestration between
different deployment processes, so those products allowing orchestration would be
prioritized.
o Continuous configuration monitoring/fixing: Deployment configuration compliance is
important in phases other than initial product installation. So those products being
able to check whether expected configuration fits current configuration in already
deployed systems would be prioritized, as well as those products being able to fix
found inconsistencies.
o Configuration compliance reports: As a complement of previous specification, it
would be valuable for the product to generate reports of configuration compliance, so
that ASSIA Inc can determine the level of compliance their products have compared
to the expected configuration set.
o File blacklist / whitelistlist / etc.: For some operational situation, it may be needed to
blacklist some specific configurations or some specific nodes. Having this capability
would results in product being prioritized.
o User/group management: Whether deployment automation products can handle
operating system user/group management, since it is part of product deployment
procedure.
Enterprise architecture automation with opensource tools 29
Enterprise architecture automation with opensource tools
o File template: For some specific processes, it may be required to work with specific
templates that can later be customized for specific customers. Product supporting this
would be prioritized.
o Notifications: Whether software deployment can notify issues founds by any
notification remote procedure, that is, an automated offline notification sent by email,
SNMP trap, etc.
o Automatic Inventory: Some products can discover automatically those nodes that may
be included under deployment automation architecture. This would help to reduce
configuration requirements and improve scalability management.
o Automatic Configuration rollback: Configuration templates may be wrong. Allowing
an easy transition to a previous configuration version would be a requirement.
o Configuration replication from specific node: whether deployment automation
software supports automatically configuration by telling new nodes to replicate the
configuration of existing ones.
Service monitoring
o Ensures a service/process is running: Applications or services not being part of
configured deployment procedures may also need to be managed and monitored.
o Supports maintenance mode: For maintenance windows, sometimes it is needed to
stop automated tools operation so that manual changes can be done.
Ad-hoc task execution
o Allow custom task execution: in many cases, Expresse deployment may require the
execution of non-standard processes, like a sequence of commands to get info and
make decisions depending on its results.
SQL Database Management
30 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o SQL Database Management: Expresse deployment procedures also need to execute
SQL scripts. Also, they would need to start/stop databases, o even to automate some
activities, related to Oracle database management.
Type of hardware supported: Whether deployment automation tools support working in
VMs, physical hardware, cloud environments, etc.
2.2.2 Monitoring software requirements
In this section, a number of requirements related to monitoring needs are provided, so that a market analysis can be performed.
Product/Architecture: Below points analyze product basic specifications information
considered relevant and important to meet:
o License type: To minimize operational costs, software license costs must be minimum
to non-existent. This way, those products having open-source licenses would be
prioritized..
o Launch date: Open-source projects are innovative, so that they help to resolve
problems from perspectives other than those used previously. Because of that,
projects launched within the last ten years with be suitable for the project.
o Last version and date: Open-source software is great, but an important risk has been
identified. It not unusual to find open-source projects which are abandoned. When a
software project is closed, and there is no community behind it, the product does not
evolve, and no bug-fixing is done, so it can represent a limitation to project needs.
Enterprise architecture automation with opensource tools 31
Enterprise architecture automation with opensource tools
Having said that, it is very important to identify those products which are still
evolving.
o Community: One of the most valuable aspects of open-source software is the activity
and size of its community. The bigger and more active it is, the better, so to benefit
from that this would be a requirement.
o Implementation language: This would not be a requirement, but a relevant aspect to
consider. When using an open-source licensed product, it is relevant to know
implementation language in case there is any specific need for the product which has
not been already implemented. This helps ASSIA Inc. to positively value new
candidates able to develop in that programming language.
o Configuration files language: An important aspect to analyze, so that standard formats
would be prioritized. This way, automated configuration and verification procedures
can be incorporated.
o User interface (UI): This would be an aspect to analyze, so that those tools having a
graphical user interface (GUI) would be prioritized against those having a command-
line interface (CLI). This would help to reduce respective learning curve.
o RAM used by central server: To determine whether a specific server is needed for this
task, this specification must be analyzed. Software having low RAM usage would be
prioritized, so that they can be deployed on available hardware, which is also used for
other purposes.
o RAM used by agent: Memory resources needed by software agents, if existing, so that
deployment software impact estimation can be done in remote hosts
o Disk used by central server: To determine whether a specific server is needed for this
task, this specification must be analyzed. Software having low disk usage would be
32 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
prioritized, so that they can be deployed on available hardware, which is also used for
other purposes.
o Disk spaces required by agent: Disk resources needed by software agents, if existing,
so that deployment software impact estimation can be done in remote hosts.
o Requires software agent in every host: Automation tools must be easily deployed. A
relevant characteristic is to know if every single server comprising any Expresse
product platform must have a specific agent installed, as part of the deployment
automation software architecture.
o Type of nodes/elements: If the deployment software requires having nodes with
different roles.
o Host interconnection method: This point must be analyzed to evaluate so that impact
on network flow policies can be determined. Those products having the lower impact
would be prioritized.
o Scalability: Whether the software supports Expresse-like platforms, from simple to
complex. Scalable products would be prioritized.
o Ease of use / learning curve: The time it takes to master deployment software use
impacts directly on ASSIA Inc. operational costs. Products having steeper learning
curves would be prioritized.
o Extensibility: Complex deployments may require from many different software
capabilities. Those products having software extensions, plugins, addons, or any other
kind of feature expansion possibilities would be prioritized.
o Dependencies: As said above, deployment automation central server software may be
installed on multi-purpose servers. To reduce ASSIA Inc. operational costs, those
products having less software dependencies would be prioritized.
Enterprise architecture automation with opensource tools 33
Enterprise architecture automation with opensource tools
Data visualization capabilities
o Visualization software type: Whether data visualization software is a desktop
application, or a web application. This may be relevant to analyze platform access.
o Visual/interactive Dashboards: The monitoring software may have customizable
dashboards and interactive Dashboards so that any information being monitored can
be added, and additional information can be obtained from the Dashboard when
interacting with it.
o Dashboard templates: By using them, all monitoring platforms will be homogeneous,
and any fix or improvements on any of them would be easily propagated to any other
platform.
o Display format: The information must be deployed on plots, widgets, tables,
depending on the data.
o Historical information display: The tool must be able to show historical information
for the last year
o Data collection granularity: Monitoring data may have different collection schedules,
depending on data source.
o Zoom capability: All plots representing data may have zoom capabilities, so that if the
user wants to focus in specific periods of time, zoom to that period can be performed.
34 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o Supports maintenance mode: For maintenance windows, sometimes it is needed to
stop monitoring system operation so that manual changes can be done not
contaminating data monitoring display.
o Thresholds setting: The system may support threshold representation for plots, so that
when needed, specific colors may be applied accordingly.
Services monitoring
o Processes uptime: Monitoring system may report how much time a monitored service
is up and running.
o Memory usage: It may be relevant to monitor memory usage, so that patterns can be
analyzed, and abnormal situations can be detected. Also, full garbage collection time
(FGCT) may be also monitored, since it is an indicator of performance issues.
Data source
o SQL: Monitoring procedures may also need to extract data from database by using
SQL scripts. Also, they would need to start/stop databases, o even to automate some
activities, related to Oracle database management.
Monitoring data storage
o Data storage may be relevant, so that it can be external to Expresse, or internal. Any
of these approaches have advantages and disadvantages that should be analyzed
Type of hardware supported: Whether monitoring tools support working in VMs, physical
hardware, cloud environments, etc.
2.3 Highlights of selected tools
Enterprise architecture automation with opensource tools 35
Enterprise architecture automation with opensource tools
Considering the criteria depicted in previous sections, a deep analysis (see chapter “Proposal description
” for details) over different tools was made. Among the different characteristics to evaluate between the candidates, it has been given an extra weight to the following:
simplicity of architecture: the simpler the tool is, the better. Simpler architectures have
lower operating costs
deployment complexity: For deployment automation tool, agent-less architectures have been
better received. For IT monitoring tools, having the possibility of customizing the monitoring
framework in deployment time has also been considered positive.
community size and activity: this characteristic is crucial, since an open source tool depends
completely on the activity of its community.
Considering these points and those from full analysis, Ansible has been chosen for deployment automation tool and Grafana for IT monitoring.
36 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
3 Design specification and restrictions
This section depicts specification and restrictions of the final approach. 3.1 Global specifications
Monitoring system must run on CentOS/RHEL systems:
Monitoring system must operate automatically. Data collection and plots data updating
should not require any human intervention.
The solution must enable automatic deployment and configuration of monitoring tools of
Expresse products. Templates must be used for customization.
Solution must allow basic Key Performance Indicators (KPI) monitoring regarding hardware
performance (CPU, RAM, disk usage, disk IOPS, network usage, etc), as well as Expresse-
specific KPIs.
System must be able to gather data from different ASSIA customers. It must support single
customer monitoring (exclusive mode), as well as multi-customer supporting (multi-tenant
mode).
Monitoring information must be homogeneous between customers and may be represented in
executive dashboards. Customer-specific monitoring information will be placed in other
dashboards.
3.2 Generic restrictions
System restrictions are listed below:
The system must have direct connectivity with environments to be monitored, and with all its
servers
Enterprise architecture automation with opensource tools 37
Enterprise architecture automation with opensource tools
Monitoring software to be deployed in any server belonging to an Expresse deployment, may
not use more than 10% of RAM memory, neither more that 10% of available CPU.
Monitoring and automation software may not have restrictive software licenses, or any
license having any cost.
The project will be initiated, planned, executed, monitored and controlled exclusively by
ASSIA Inc. employees.
3.3 Business process KPI specification
There are four different groups of KPIs that should be monitored:
Expresse Network Statistics: DSL/PON network-wide analysis based on network statistics
Expresse module. These statistics are already available in Expresse GUI, but for convenience
of operation and monitoring team, they will have its own place in this monitoring framework.
Included statistics will be, at least, the following:
o Stability: QoS as calculated by Expresse platform. It must be separated in two plots,
each one for DSL and another for PON technologies.
o DS_SYNCH_RATE: Average downstream synch rate (only for DSL).
o US_MABR_ESTIMATED: Average downstream maximum achievable bit rate (only
for DSL).
o DS_MABR_ESTIMATED: Average downstream maximum achievable bit rate (only
for DSL).
o US_SYNCH_RATE: Average upstream synch rate (only for DSL).
o Distribution of lines per Technology: Count of provisioned lines per technology (DSL
or PON)
38 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o Distribution of DSL lines per DSLAM type: Count of provisioned DSL lines per
DSLAM model.
o Distribution of PON lines per chassis model: Count of provisioned PON lines per
chassis model.
o Distribution of DSLAMs per model - DSL: Count of provisioned DSLAMs per model.
o Distribution of PON chassis equipment per model: Count of provisioned PON chassis
per model.
Expresse modules monitoring: Indicators to monitor module performance, as indicated in
[3], attending to its specific operation characteristics.
o Generic: Include common metrics and JVM monitoring: This group would apply to
all Expresse modules. Since they are written in Java, some aspects of the JVM can be
monitored.
o Provisioning application
. Data source type: SQL
. Execution duration per schedule: total duration of each executions in last
day
. Historical execution duration: time-series statistics with duration of
processes
. Provisioning errors per day/type: If any, the type and amount of
provisioning errors during current day
. Lines Added vs deleted vs updated per Technology / Per region: For
successful provisioned lines, percentage per type of update for each
defined region.
Enterprise architecture automation with opensource tools 39
Enterprise architecture automation with opensource tools
. DSLAMs Added vs deleted vs updated per Technology / Per region: For
successful provisioned DSLAMs, percentage per type of update for each
defined region.
. External plant lines Added vs deleted vs updated per Technology: For
successful provisioned records, percentage per type of update and
technology.
. Lines with disabled PE/PO (%): Percentage of lines that have diagnostics /
optimizations capabilities disabled of total provisioned
o PE
. Data source type: SQL
Duration per main processes (PE DSL, LINE SUMMARY, PO
Trigger, SR, PE PON, PON LINE_SUMMARY, PON SR)
DSL Instability: Percentage of DSL lines for which QoS is degraded
(stability indicator in “UNSTABLE” or “VERY UNSTABLE” level).
DSL Instability 1 month: Percentage of DSL lines that last 30 days
experienced instability more than 30% of time..
PON Instability: Percentage of PON lines for which QoS is degraded
(Link quality indicator in “SEVERELY DEGRADED” or “SERVICE
INTERRUPTED” level).
Results distribution analysis:
o DSL
. % of DSL lines with diagnostics calculated running day,
per source type (POP_O/PerTone)
. % lines with PE per line card type
40 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
. % lines with PE per dslam type
. % lines without PE per CPE_type: statistics per
DSLAM type and line card type on CPEs not having
diagnostics, when more than 30 % of cases are
affected, if existing for more than 100 customers.
. Dispatch recommendation analysis
% DSL lines with dispatch recommended per
type of recommendation
. PO Trigger
% of lines that were evaluated by PO trigger,
and action was needed
Number of lines that were sent to PO, by reason
(code with description)
Number of lines that failed when being sent to
PO, by reason (with description)
. SR
% of active lines with SR results: The number of
provisioned lines having service
recommendations.
% of SR faults by code: SR executions that
ended up in error.
o PON
. % PON lines with PE per line_card_type
. % PON lines with PE per dslam_type
Enterprise architecture automation with opensource tools 41
Enterprise architecture automation with opensource tools
. % PON lines with PE per DSLAM type, line card type
and CPE_type: Useful to detect interoperability issues
. % PON lines with PE per CPE_type
. Dispatch recommendation analysis
% PON lines with dispatch recommended per
type of recommendation
% lines with dispatch recommended per location
(different than drop)
o PO. Extended information can be found in [5]
. Data source type: SQL
Process duration
PO process indicators:
o Lines exiting PO: The number of lines for which optimization
process has ended today.
o Lines entering PO: The number of lines for which optimization
process has started today.
o Lines already in PO: The number of lines for which
optimization process started before today.
o Lines in PO by source type: The number of lines that are
currently in PO, by type of requester.
o Total lines per number of days being optimized:
PO Profile change efficiency:
o % of successful profile changes:
o % of profile change failures per type:
42 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o PO profile changes per hour: number of requests executed per
hour
For lines that ended optimization (PO_RECORD):
o Avg number of iterations per STABILITY_START
o Avg number of iterations per DS_STABILITY_START
o Avg number of iterations per US_STABILITY_START
o Distribution of number of profile changes, per
STABILITY_START
o Avg % improvement in DS rate
o Avg % improvemet in US rate
o ReqGen
. generic JVM monitoring
. Start time of each collection block: The point in time every single activity
generated by ReqGen started.
. Duration time of generating schedules: The time it takes for a ReqGen block
to be executed.
o DBLoader
. Only generic JVM monitoring
o DcPc
. Data source type: SQL
Collection distribution along the day per REQUEST_TYPE: Number
of requests processed by DcPc during last 15 minutes.
Collections efficiency
o DSL
Enterprise architecture automation with opensource tools 43
Enterprise architecture automation with opensource tools
. Per DSLAM collections. One graph representing:
% of DSLAMs with connectivity issues per
region Those network elements that are not
reachable from the platform during last 3 days.
. Per LINE (Collections):
LINE_CARD
o % of lines with null no line card
information: If line card has not been
collected, there will be connectivity
issues with network elements. Besides,
this can impact other Expresse processes,
like PO.
POP_O
o DSLAM count per % of lines with
POP_O collections: A representation of
previous day collections, classifying
DSLAMs according to the % of lines in
those DSLAMs that have collections.
o % of lines with POP_O each hour.
POP_P
o % of lines with POP_P each hour.
PER_TONE
o % of lines with PER_TONE_DATA each
hour and its validity for PE: not all data
44 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
collected showing operational status of
each tone can is eligible to be used for
performace evaluation.
BULK_VENDOR_ID
o % of lines with BULK_VENDOR_ID:
Availability of information if customer
premises equipment
. Line operations:
Port management operations: Number of
operations per type per hour
Port management efficiency: Result of
operations per result code per hour
Profile changes: Number of operations per type
per hour and source type
Profile change efficiency: Result of operations
per result code per hour
o PON
. LINES:
Accumulated % of lines with PON operational
data collected each hour
Accumulated % of lines with PON performance
data collected each hour
Enterprise architecture automation with opensource tools 45
Enterprise architecture automation with opensource tools
Accumulated % of lines with PON customer
premises equipment related information, each
hour
. OLTs:
Accumulated % of OLT ports with operational
data, each hour
. Per LINE (Actions):
PON ONT management operations per type per
hour
PON OLT Port management operations per type
per hour
o Tomcat
. Global indicators
Transactions per second (tps) vs avg response times last 15 min:
Transactions per second that are served by Tomcat server, and how
much time, on average, it took to served them. This indicator allows
both load monitoring and performance monitoring,
Tps per invoked webservice and method: similar to previous indicator,
but with more detailed information, since it drills down into the
different web services that were attended.
Error distribution last 15 min: In case there were errors when
attending requests, type of errors and count.
Timed out request count last 15 min: Detail on the previous indicator,
focused in time-out errors, differentiating different types of time out.
46 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Tps + Avg response time for main users
. Dynamic reports
Execution time for reports: The time it took for dynamic reports to
execute (start time to end time)
Number of reports with issues in report generation
. Real Time operations
Avg time per location (Tomcat queue vs dcpc /pe / po vs db): Time to
complete real-time operations, and average time distribution depending
on the module attending to the request.
3.4 Generic KPI specification
Expresse deployments typically have standardized architectures, with specific roles defined for each server, as defined in “Expresse server architecture” section. To monitor Expresse efficiently, we would need to consider the role of the servers to represent its metrics. Nevertheless, there are server- specific raw metrics whose monitoring could help operation and maintenance teams to detect abnormal behaviors, bottlenecks and points of failure. Those metrics are the following:
3.4.1 Generic hardware indicators
5min load: For last 5 minutes, average CPU Load
CPU usage: historical data, been read with a determined frequency, to be determined.
Uptime: The time monitored server has been up and running.
Users connected monitoring: number of users being connected at a time
Temperature: If available (and reported), server temperature.
Enterprise architecture automation with opensource tools 47
Enterprise architecture automation with opensource tools
Processes: Number of alive process existing in a specific point in time
Memory Usage: RAM memory usage, and type
Swap usage: Swap memory usage
Disk usage (MBps + IOPS): Total amount of data transmitted or received, as well as the
measured read and write operations to the disk.
Network traffic: per interface, amount of data transmitted and received.
With them, a global dashboard for server monitoring will be set, so that the ASSIA Inc. team can have a quick view on current (and historical) server status.
3.4.2 Database (Oracle) specific indicators
a) Generic Oracle indicators (more information in [14] and [15])
Data source type: SQL
o pctg_active_sessions: Percentage of Oracle sessions used
o pctg_active_processes: Percentage of Oracle processes used
o pctg_active_transactions: Percentage of Oracle transactions used
o pctg_opened_cursors: Percentage of opened cursors opened.
o oracle_transactions_per_second: Transactions per second per system stats interval
o Cache hit ratio: Number of I/O requests that were satisfied in the cache / Total IO
requests
o Redo generated per second:
o Redo writes per second:
o Total_Table_Scans_Per_Sec
o Database_CPU_Time_Ratio
o Row_Cache_Hit_Ratio
48 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o Library_Cache_Hit_Ratio
o Cursor_Cache_Hit_Ratio
o CPU_Usage_Per_Sec
o Database_Wait_Time_Ratio
o Response_Time_Per_Txn
o Buffer_Cache_Hit_Ratio
o Physical_Write_IO_Requests_Per_Sec
o Physical_Read_IO_Requests_Per_Sec
o Physical Read Bytes Per Sec
o Physical_Write_Bytes_Per_Sec
o Physical_Read Total Bytes Per Sec
o Physical_Write_Total_Bytes_Per_Sec
o Physical_Reads_Per_Sec
o Physical_Writes_Per_Sec
o Redo_Allocation_Hit_Ratio
o Physical_Write_Total_IO_Requests_Per_Sec
o Physical_Read_Total_IO_Requests_Per_Sec
o I/O_Megabytes_per_Second
o I/O_Requests_per_Second
o Temp_Space_Used
o User_Transaction_Per_Sec
o SQL_Service_Response_Time
o % used space per tablespace and free space: This metric will let to know if any action
is to be taken on disks space clean-up, new datafiles to add, or disks to add.
Enterprise architecture automation with opensource tools 49
Enterprise architecture automation with opensource tools
o Number of invalid objects: Represents the count of objects that are in ‘INVALID’
status
o Number of locked objects: the count of objects of any type that are marked as
‘LOCKED’ status
3.5 Dashboard definition
For this project, we will implement a dashboard to display server metrics for specific customers.
Since this is a software platform, the operating status of servers where it is running on may be monitored. Also, this dashboard may be used both by customers and ASSIA support team. Relevant
KPIs that will be monitored are:
This will include all servers, with subpanels per type of statistic.
o 5min load: For last 5 minutes, average CPU Load
o CPU usage: historical data, been read with a determined frequency, to be determined.
o Uptime: The time monitored server has been up and running.
o Users connected monitoring: number of users being connected at a time
o Temperatures: If available (and reported), server temperature.
o Processes: Number of alive process existing in a specific point in time
o Memory Usage: RAM memory usage, and type
o Swap usage: Swap memory usage
o Disk usage (MBps + IOPS): Total amount of data transmitted or received, as well as
the measured read and write operations to the disk.
o Network traffic: per interface, amount of data transmitted and received.
50 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
4 Proposal description
After market research looking for tools that meet all aspects described in 2.2.1 IT deployment automation selection process and 2.2.2 Monitoring software requirements chapters, a decision has been made to meet Design specification and restrictions. In next paragraphs selection and implementation processes are described.
4.1 IT deployment automation 4.1.1 Selection process and final decision
As listed in 2.2.1 IT deployment automation selection process, a number of requirements are specified for automatic deployment tool selection. Many tools have been found in the market that meet some of the required specifications, but only a group of 4 met most of the requirements. These tools are Ansible, SaltStack, Rudder and CFEEngine. A requirements compliance matrix is included in Annex D: Deployment automation tool analysis matrix.
The investigation process lasted 1 week. All information sources were mainly web pages related to each tool, as well as own investigation by testing the tools. Enough information was gathered to make a decision, Ansible becoming the selected tool.
The research process early discarded CFEEngine. The tool seemed to be very powerful and efficient, but some misalignments were found:
It has no community version
Very Old launch date. The product has been over there for 26 years, which in software is too
much time.
Enterprise architecture automation with opensource tools 51
Enterprise architecture automation with opensource tools
Configuration files format is not standard
Requires agent in any host, so initial configuration is complex
Requires extra efforts in connectivity, which add difficulties since it would need an activity
where customer’s security team configure specific firewall rules, which can introduce
additional delays in project execution.
Learning curve is not optimal
Next, Rudder was analyzed. It must be said that the product seems to be promising, but some aspects made it not eligible. Below most important cons:
No big community. This tool is not widely known, and development team is small.
Requires agent installed in any host, so initial configuration is complex
Requires extra efforts in connectivity, like the previous tool.
For the remaining options, Ansible and SaltStack, the decision was not very clear. Both tools seemed to be as good as each other, with very good community and extensive documentation. But again, one of them required extra connectivity, and this one was SaltStack. So eventually, the chosen tool was
Ansible. The points of just needing SSH access to hosts and not requiring local agents in every node, makes it so simple to orchestrate complex Expresse deployments. Setting up an environment like the one depicted in Figure 35. Most complex Expresse architecture is just a question of a few minutes, after support team having the initial set of configurations implemented. This only weakness found was that it needs at least Python 2.6 installed on each managed server, but for ASSIA customers this is not a problem since most servers have RedHat 6.* or newer operating system (or CentOS- equivalent version), so Python 2.6 is included.
52 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
4.1.2 Ansible to automate deployments
It is not the purpose of this document to be a guide for this tool. However, for ease of understanding, some concepts will be referenced, and brief explanations will be provided:
Ansible connectivity. To use Ansible, it is a requirement to have SSH connectivity with target
hosts.
Ansible Inventory is the set of target hosts, groups of target hosts, variables (global, group-
specific, host-specific), and any other configurations declared in variables, that may apply for
a specific scenario. For this project, inventories will exist for each customer, as different
customers have different servers, architectures, and configurations. Centralized monitoring
scenario will also have its own set of configurations as an additional inventory.
Ansible playbooks are the set of tasks and/or commands that must be executed in the right
sequence in order to perform an activity. This activity may be executed in specifed hosts, that
may be defined in inventory, or over defined group of hosts, or simply indicating them in a
comma-separated list of hosts. A use case for playbooks would be setting all the tasks needed
to download a specific software from a software repository, then upload it to the target
servers, install it and later alter standard configuration to meets specific needs.
Ansible roles are playbooks written to generalize activities no matter the environment. They
have their own set of default variables and tasks. Later, roles may be imported by playbooks
that will use it for its customer-specific environment.
variables can be custom or Ansible built-in. They can be used in playbooks or roles using
Jinja2 templating system ([17]). Normally they are used to adapt playbooks or roles to
specific needs of each environment. Its values can be inherited (from roles or groups), and
inherited values can be overwritten by playbooks, or at Ansible launch time.
Enterprise architecture automation with opensource tools 53
Enterprise architecture automation with opensource tools
Once those key concepts are described, it is needed to mention that, for this project, Ansible will be installed and used from one ASSIA Inc. server, as specified in Generic restrictions. The server will not be dedicated for this purpose, since Ansible is extremely lightweight (its Linux package only takes 53MB), and all needed configurations are stored in plain text files, which should not need much disk space. RAM and CPU requirements are minimum, so any host having 1 GB of RAM and one CPU should do the work with no issues. This way, ASSIA will not have any extra expenses caused by new infrastructure.
To end with, as Ansible will be installed in ASSIA servers, only members of corresponding customer support teams, which are all ASSIA employees, will have access to perform operations with this tool.
4.2 Monitoring Framework
Reviewing past projects at ASSIA, two monitoring tools have been used for other projects: Grafana and Kibana. We will focus on them, as that way we will ease learning curve, and will take advantage of that already available expertise. In next paragraphs we will find a brief analysis on both tools.
Kibana ([1]) is a great and fancy data visualization tool that comes which Apache 2.0 open source license. It is used to represent data stored on ElasticSearch ([6]), which is an advanced search engine based on Apache Lucene. Both tools, together with LogTash, conform the ELK stack, which has become one of the most popular tools for data analysis. With ElasticSearch, data is stored in an efficient way, allowing quick searches in huge data stores. Many kinds of information can be stored, from logs, to time-series statistics, being able to manage tons of information in an efficient way. The product scales very well, being able to cover many use-cases. Regarding system requirements, the minimum setup would be to have an 8 GB RAM server, needing fast disks and many-core CPUs. For best performance, 64 GB RAM is recommended. This seems to be not aligned with low-cost needs
54 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
of this project. Additionally, it is not likely to be compliant with one of the points mentioned in
Generic restrictions section, where it is specified that memory usage may not be over 10% of total
RAM memory; this, automatically, makes it not eligible for this project.
On the contrary, Grafana ([8]) is a lightweight web tool for data representation. It supports many data sources, including ElasticSearch engine, InfluxDB, Prometheus, Graphite and more. Also, it has a great community that has contributed hundreds of dashboards that can be used cost-free with little integration effort. It also provides a TV-mode that will be suitable for multi-customer scenario. To end with, it is easy to install and configure and works as a Linux daemon, so can be configured for automatic startup in case of server reboot. All those points make it eligible for this project.
Regarding data sources, InfluxDB has been chosen to store all data. This software, together with
Telegraf and other tools, are part of the TICK stack, developed by Influx, an open-source-oriented company. Both tools share the criteria evaluated to choose Grafana and integrate well and easily with it. It also works as a Linux daemon, so can be configured for automatic startup in case of server reboot. Telegraf is also extremely easy to deploy and configure and meets all requirements regarding monitoring tools. It is also lightweight, versatile and supports a wide variety of plugins, most of them built-in, to meet data collection needs of this project. To end with, it supports schedule customization to gather different indicators, and, as InfluxDB, works as a Linux daemon, so can be configured for automatic startup in case of server reboot.
4.3 Architectures proposal
Once automatic deployment and monitoring tools have been chosen, the next step is to define the right solution to meet project specifications. Two different architectures will be designed: “single-
Enterprise architecture automation with opensource tools 55
Enterprise architecture automation with opensource tools
customer scenario” and “multi-customer scenario”. For both scenarios, servers will be labeled with specific roles, depending on their function. Accordingly, roles will be:
Data collection role: This role will be responsible for gathering all data corresponding to
defined KPIs. Any node having this role will need to use a tool to perform system metrics
collection, as well as Expresse-specific metrics or KPIs collection. Telegraf has been chosen
for this purpose, since it comes with many plugins that allow collection of most of the metrics
defined in 3.3 Business process KPI specification and 3.4 Generic KPI specification. For that,
the following plugins will be included and configured: cpu, disk, diskio, http, nstat, ntpq,
processes/procstat, swap, system, exec. All those plugins are documented in [9]. Exec plugin
will be used to get KPIs gathered using SQL.
Data store role: In this case, role purpose will be to receive all collected data and store it, in
a time-series fashion, so that all data will be labeled accordingly, and a timestamp will be
added, so that any data or metric can be represented as an evolution of values in time.
Additionally, this role will also be responsible to receive data gathering request from data
representation role entities. InfluxDB will be the database used, as referenced earlier. Product
information can be found in ¡Error! No se encuentra el origen de la referencia..
Data representation role: all data stored by previous role, will be used by this one in order
to plot it in dashboards defined in Dashboard definition section. The tool used for this
purpose will be Grafana, as mentioned earlier.
Deployment role: responsible to orchestrate all the operations needed to deploy the different
tools and files, and execute any actions needed, in order to have the monitoring framework
installed and running with corresponding configurations.
56 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Once roles are defined, next step is to represent them into architecture proposals. Firstly, we introduce the architecture design for single-customer scenario:
Figure 1. Single-customer scenario architecture
In this architecture, it can be observed that all servers hosting monitoring framework belong to the customer and are in customer premises, and the one orchestrating automatic deployment belongs to
ASSIA Inc. Below some highlights on the architecture:
All servers located in customer premises have data collection role. This role will be
performed by Telegraf.
Server named “MonServ” will have both data store and data representation roles. Data
representation will be implemented with Grafana, which will be accessible using HTTP and
Enterprise architecture automation with opensource tools 57
Enterprise architecture automation with opensource tools
corresponding URL 3000. An example of this URL would be http://MonServ:3000/. Data
store role will be performed by InfluxDB, which will be listening in port 8086, both for
incoming data collected by Telegraf, and for data requests coming from Grafana tool.
Server named “DBServ” is the one where Expresse database is installed. It will host a
collection of scripts orchestrated by Telegraf, to gather some indicators (SQL-type) specified
in 3.3 Business process KPI specification.
Server named “DepServ” will host Ansible tool and will have direct connectivity to all servers
in customer premises by means of a site-to-site VPN. Over that VPN, SSH connectivity will
be available through a jump host, which is not represented in above design, for clarity. For
this project, Ansible has been installed directly in jump host server, to reduce ASSIA-internal
networking overhead.
Secondly, an architecture to perform centralized platform monitoring in Expresse deployments from many customers, for ASSIA Inc. internal use, has also been designed as part of this project. This was previously named as multi-customer scenario. To implement this, it is required that each customer being included, should have single-customer scenario already implemented, although data representation role is not mandatory. Architecture is slightly different from the previous scenario:
data collection roles are not represented, although is assumed that will be deployed as part of
respective single-customer scenario.
data representation role is implemented in ASSIA premises
data representation role will be connecting to data store servers of each customer:
58 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 2. Multi-customer scenario architecture
Current project implements single-customer scenario.
4.4 Ansible installation
To start working with Ansible, it will be installed in DepServ. Instructions to do so can be found in
[16]. The only additional information that is worth mentioning here, is that (as indicated before) installation requires at least Python 2.6, and that requirement must be met in any host being managed. Additionally, it may require libselinux-python package in case hosts being managed have
SELinux service enabled.
Enterprise architecture automation with opensource tools 59
Enterprise architecture automation with opensource tools
4.5 Ansible roles specification
Next step is Ansible roles definition and implementation. The goal is to identify different activities to be done, enumerate corresponding tasks, sequence them, and with that information, implement
Ansible roles. Below identified activities, and execution order:
Deploy Grafana. This activity requires performing the following tasks:
o upload corresponding package to target host
o install package once uploaded
o ensure required folders exist
o configure data sources and specific dashboards, depending on the scenario to be
implemented
o configure server to start automatically after host reboot
o start service
Deploy InfluxDB:
o upload corresponding package to target host
o install package once uploaded
o ensure required folders exist
o configure server to start automatically after host reboot
o start service
Deploy Telegraf:
o upload corresponding package to target host
o install package once uploaded
o Depending on whether the target host is type ServN o DBServ, apply corresponding
configuration, uploading all additional files needed for DBServ.
60 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o start service
Each role will have its own set of default variables, which will be the following:
local path to DepServ where specific rpm is stored
Path in target host where rpm will be uploaded
Path to configuration templates to be used, if any
Configuration that must be used to connect to monitoring database, if needed
4.6 Ansible inventories
For each customer, an inventory must be defined. As described before, a customer inventory may have custom configurations that only apply to the corresponding customer. Below a list of information that may be provided for each inventory:
List of servers, grouped by server type, as specified in Architectures proposal. Servers must
be reachable from DepServ
For each server, SSH credentials to be used
For DB servers, Oracle credentials to be used
On the other hand, an additional inventory to implement single-customer scenario will be defined, containing information on target host (ASSIAMonServ), such as IP, port to SSH if other than standard, credentials, and any other relevant configuration.
4.7 Ansible playbooks
Enterprise architecture automation with opensource tools 61
Enterprise architecture automation with opensource tools
For each customer, a playbook will be defined to import all roles defined, configured to implement single-customer scenario. Those playbooks will be paired off corresponding customer inventories, and that way all configurations needed will be ready to use.
62 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
5 Results
To test the implementation, we have first implemented roles specified in 4.5 Ansible roles specification. Initial tests have been focused in individual roles, using specific playbooks that only imported the role to be tested. Once having this been checked, Ansible playbooks indicated in 4.7
Ansible playbooks have been implemented to test them, using virtual machines that ASSIA Inc. already had for testing purposes. A common inventory has been used for all tests. In next paragraphs, we describe all tests performed.
5.1 install_influx role testing
The purpose of this test is to check that InfluxDB software is installed properly in any target host.
These target hosts may be included in Ansible inventory, and there may belong to a group named
“monitoring_db_servers”.
Step 1: check that software is not installed
Figure 3. install_influx role pre-execution checks
Those evidences are enough to check that software was absent in target host, which for this test is telefonica-chile-dev.
Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev.
Enterprise architecture automation with opensource tools 63
Enterprise architecture automation with opensource tools
Figure 4. install_influx role execution evidence
Apparently, service start failed, but when checking, it can be observed that it didn’t:
Figure 5. install_influx role post-execution check
So test was successful.
Command used to install:
date && ansible-playbook -i environments/hosts playbooks/main.yml --limit monitoring_db_servers
In used inventory, monitoring_db_servers only contained target host, for testing purposes.
playbooks/main.yml was also adapted for this test.
64 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
5.2 install_telegraf role testing in normal server
For this case, the purpose of this test is to check that telegraf software is installed properly in any target host, that data is being collected, and that it is being correctly sent to InfluxDB. These target hosts may be included in Ansible inventory, and there may belong to a group named “servers”. To tell Telegraf which InfluxDB must receive collected data, playbook inventory must define influxdb_connection and variable accordingly. Also, to perform this test is required to have InfluxDB started and listening for incoming data.
Step 1: check that software is not installed
Figure 6. install_telegraf role pre-execution tests
Those evidences are enough to check that software was absent in target host, which for this test is telefonica-chile-dev.
Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev.
Enterprise architecture automation with opensource tools 65
Enterprise architecture automation with opensource tools
Figure 7. install_telegraf role execution evidence
In target host, we can observe that telegraf service is working:
Figure 8. install_telegraf role post-execution check (1/3)
Also, we can observe that telegraf has created a database to store its gathered data in previously installed InfluxDb:
Figure 9. install_telegraf role post-execution check (2/3)
In addition, we can see data being stored in the database:
66 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 10. install_telegraf role post-execution check (3/3)
With that there is evidenced that test was successful. Used command was: date && ansible-playbook -i environments/hosts playbooks/main.yml --limit telefonica-chile-dev
5.3 install_telegraf role testing in database server
The purpose of this test is to check that telegraf software is installed properly in any hosts where
Expresse database is installed, that expresse KPI and oracle indicators are being collected, and that it they are being correctly sent to InfluxDB. These target hosts may be included in Ansible inventory, and there may belong to a group named “db_servers”.
Step 1: check that software is not installed
Figure 11. install_telegraf role for database server. Pre-execution check
Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev.
Enterprise architecture automation with opensource tools 67
Enterprise architecture automation with opensource tools
The evidences gathered for this step will be split in many screenshots, since playbook has many tasks and there are many files to upload:
Figure 12. install_telegraf role for database server. Execution evidence (1/4)
Figure 13. install_telegraf role for database server. Execution evidence (2/4)
68 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 14. install_telegraf role for database server. Execution evidence (3/4)
Figure 15. install_telegraf role for database server. Execution evidence (4/4)
After that, the next checks were performed:
Enterprise architecture automation with opensource tools 69
Enterprise architecture automation with opensource tools
Figure 16. install_telegraf role for database server. Post-execution checks (1/2)
Additionally, some checks were performed in DBServ to test if Expresse and Oracle metrics were
being sent to InfluxDB:
Figure 17. install_telegraf role for database server. Post-execution checks (2/2)
So test was successful.
Command used: date && ansible-playbook -i environments/hosts playbooks/main.yml --limit telefonica-chile-db
5.4 install_grafana role testing
As before, the purpose of this test is to check that Grafana is installed properly in any target host, is
accessible and represents data from InfluxDB correctly. Target hosts may belong to group
monitoring_gui_servers.
Step 1: check that software is not installed
70 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 18. install_grafana role pre-execution checks
Step 2: Launch playbook from DepServ. In this test, DepServ will be fgarzas-dev
Figure 19. install_grafana role execution evidence
Once playbook run, we accessed Grafana by using a browser, with the following URL: http://telefonica-colombia-dev.assia-inc.com:3000/. There we checked that data source was correctly configured and that initial dashboard was OK, as shown in below snapshots:
Enterprise architecture automation with opensource tools 71
Enterprise architecture automation with opensource tools
Figure 20. install_grafana role post-execution checks (1/8)
Figure 21. install_grafana role post-execution checks (2/8)
72 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 22. install_grafana role post-execution checks (3/8)
Below dashboard snapshots plotting corresponding information:
Figure 23. install_grafana role post-execution checks (4/8)
Enterprise architecture automation with opensource tools 73
Enterprise architecture automation with opensource tools
Figure 24. install_grafana role post-execution checks (5/8)
Figure 25. install_grafana role post-execution checks (6/8)
74 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 26. install_grafana role post-execution checks (7/8)
Figure 27. install_grafana role post-execution checks (8/8)
So test was successful.
Command used: date && ansible-playbook -i environments/hosts playbooks/main.yml --limit telefonica-colombia-db
Enterprise architecture automation with opensource tools 75
Enterprise architecture automation with opensource tools
76 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
6 Budget As indicated several times along this document, the goal of this project is to reduce OPEX, so costs have been reduced to the minimum. This way, the budget to execute this project does not include any equipment purchase, any license costs, and any external professional services, since they have not been needed. The project has been planned, executed and closed just with ASSIA internal resources.
In the next list, resources are described:
Project Manager. This resource was needed to perform the following tasks:
o Determine activity list
o Define project schedule
o Gather historical project documentation that might be required
o Determine required resources
o Define communications plan
Director of Deployment Engineer: This resource was needed to:
o Approve project scope
o Resolve legal-related needs of the project
o Confirm project compliancy with ASSIA current needs.
Delivery experts board: This team was required to:
o review solution requirements
o to oversee KPIs to be monitored
It consisted on 3 different experts, with a range of experience in Expresse deployments
and technology-specific aspects.
Development experts board: This team was required to:
o Provide details on relevant Expresse aspects to be monitored
Enterprise architecture automation with opensource tools 77
Enterprise architecture automation with opensource tools
o Help to better focus in KPIs that should be monitored
It consisted on 4 different developers, that have a deep knowledge on Expresse
solution.
Deployment engineer: This resource was responsible for:
o Product research
o KPI proposal
o Project documentation
o Project execution
With those resources, project budget has been calculated, resulting in a final cost of 23.300€. Further
details on budget calculation can be found in next table:
Number of Average cost per Number of Role description Final cost resources resource/day days Project Manager 1 1.000,00 € 4 4.000,00 € Director of Deployment Engineering 1 2.000,00 € 0,5 1.000,00 € Delivery experts board 3 900,00 € 1 2.700,00 € Development experts board 4 900,00 € 1 3.600,00 € Deployment engineer 1 600,00 € 20 12.000,00 € TOTAL 23.300,00 €
78 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
7 Conclusions
Once implementation steps have been finished, monitoring frameworks are set up and collections are started, Expresse monitoring task changes completely compared to previous approach. Now, data is being gathered and loaded into different InfluxDB instances deployed, and then plotted in Grafana dashboards. With those dashboards, an ASSIA engineer just need 5 minutes to review Expresse customer servers status, and that review is a much easier task. Additionally, learning curve has been amazingly reduced, since now engineers don’t need to have a deep understanding of the processes to detect issues in the platform.
The implemented solution has been also a great investment in financial terms. Before it, a deployment engineer spent half hour per day per customer to review each deployment with needed detail. Now engineers only need 5 minutes, so time reduction is 83%. With a very simple calculation, we can figure out the efficiency of this project. The following concepts allow us to simplify calculation of project efficiencies:
C0: cost of deployment engineer resource = 600€/day
N: Number of customers
T0: Time it takes per day for deployment engineers to monitor Expresse, without presented
solution, in working days
T1: Time it takes per day for deployment engineers to monitor Expresse, with presented
solution, in working days
Tt: Total working days per year = 5 days/week * 52 weeks/year = 260 days/year
Total time an engineer used to spend, per year (in working days) for monitoring deployments of all customers is:
Enterprise architecture automation with opensource tools 79
Enterprise architecture automation with opensource tools
Tprev= N * (T1) * Tt = N * (0,5 hours / 8 hours/day) * 260 days/ year= 16,25 * N days per year
Time per day needed with new monitoring framework:
Tnew = N * T1 * Tt = N * ((5/60 hours) / 8 hours/day) * 260 days/year ~ 2,75 * N days/year
Total time saved with new framework (in working days per year):
Ttime_saved = Tprev - Tnew ~ 13.5 (days/year) * N customers
With that, total savings per year would be Ttime_saved * C0 * N = N * 8100 € / year
As project budget is 23.300€, if ASSIA uses this new framework to monitor at least 3 customers, the investment would be recovered in less than 1 year.
Additionally, the solution implemented has set the base for future project extensions, where new indicators will be defined, contributing to automate new activities that will result in efficiency increase.
Summarizing, the main conclusion is that this approach helps ASSIA to reduce significantly OPEX related to customer platform monitoring, meeting all specifications and restrictions successfully, and with a little investment that will be recovered in a very short period.
80 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
8 Future works
For ASSIA Inc. this project set places the first stone towards efficient monitoring automation.
Having the possibility of easily deploying and configuring monitoring frameworks is good news for the company. But this project can be extended in many ways. Below we enumerate a set of proposals and brief description, to draw future work streams that will be probably implemented within this year.
Expresse and Oracle KPI dashboard implementation. As KPIs are already defined, next
step is to define dashboards to represent mentioned KPIs. They will be designed following a
top-down approach, so that in the top of the page will have a brief summary with last values
for main KPIs and scrolling down ASSIA engineers will have much more details on every
process.
Multi-customer environment implementation. As ASSIA grows and is awarded by new
customers, it gets more importance to have main KPIs monitored in a central platform,
allowing ASSIA to benchmark different customers easily, bringing the possibility of
detecting anomalies at high level.
New KPI definition and corresponding dashboard modification/implementation. This
initiative has been very well received by ASSIA colleges from other departments like
Systems Engineering and have showed interest on defining new KPIs and including them in
monitoring framework, to automatically have a deep detail on different processes.
Enterprise architecture automation with opensource tools 81
Enterprise architecture automation with opensource tools
82 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
References
[1] Elasticsearch B.V. (2019). Kibana product page. Retrieved from https://www.elastic.co/products/kibana [2] ASSIA Inc. (2019). Retrieved from ASSIA Inc. web site: https://www.assia-inc.com/ [3] ASSIA Inc. (2019). Application Monitoring. Redwood City: ASSIA Inc. [4] ASSIA Inc. (2019). Expresse High Availability Setup. Redwood City: ASSIA Inc. [5] ASSIA Inc. (2019). Profile Optimizer. Redwood City: ASSIA Inc. [6] Elasticsearch B.V. (2019). Elasticsearch engine product page. Retrieved from https://www.elastic.co/products/elasticsearch [7] FOSSA, Inc. (2019). Software Licenses in Plain English. Retrieved from tl;drLegal: https://tldrlegal.com [8] Grafana Labs. (2019). Grafana product page. https://grafana.com [9] Influx data. (2019). Telegraf Input Plugins. Retrieved from Influx data web page: https://docs.influxdata.com/telegraf/v1.10/plugins/inputs [10] Influx Data. (2019). InfluxDB product page. Retrieved from https://docs.influxdata.com/influxdb/ [11] Normation. (2019). What is RUDDER? Retrieved from RUDDER: http://www.normation.com/en/rudder/what-is-rudder/ [12] Northern.tech, Inc. (2019). CFEngine product page. Retrieved from CFEngine : https://cfengine.com [13] Open Source Software Initiative. (2019). Retrieved from Open Source Software Initiative Web Site: https://opensource.org/ [14] Oracle Corporation. (2019). Retrieved from Oracle Community: https://community.oracle.com [15] Oracle Corporation. (2019). Documentation. Retrieved from Oracle Help Center: https://docs.oracle.com [16] Red Hat, Inc. (2019). Automation for everyone. Retrieved from Red Hat Ansible: https://www.ansible.com/ [17] Ronacher, A. (2019). Jinja. Retrieved from http://jinja.pocoo.org/ [18] SaltStack, Inc. (2019). Retrieved from Saltstack: https://www.saltstack.com/ [19] Influx data downloads. (2019). Influx data downloads. Retrieved from Influx data web page: https://portal.influxdata.com/downloads/
Enterprise architecture automation with opensource tools 83
Enterprise architecture automation with opensource tools
84 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Annex A: Diagrams
In this section, there can be found a set of diagrams that are referenced along this document, followed with a brief description.
Figure 28. Generic Expresse integration architecture
Above figure represents a generic scenario with Expresse integrated in customer systems architecture. On left-lower side, inventory systems provide Expresse with information of which customers they want to include in Expresse systems, where they are provisioned, which services customers are subscribed to, and how is their external plant, in terms of primary/secondary cables, terminal boxes (DSL) or ODN distribution. This information is relevant to Expresse, as it is used to map network resources (lines as existing in network elements) to business resources (any customer
Enterprise architecture automation with opensource tools 85
Enterprise architecture automation with opensource tools
identification). Additionally, trouble tickets can be provided to Expresse, to improve recommendations accuracy, as well as to ease ASSIA support team assessments.
On the right-lower side, a cloud representing network elements can be found. Expresse connects directly to them without using any NMS from equipment vendor. Communication is commonly achieved by using SNMP, although other protocols are supported.
On the right-upper side, several actors and systems can be found. Some of them are users connecting to Expresse by HTTP/HTTPs, in order to access Expresse GUI; other users may connect to Expresse indirectly, by using SOAP web-services triggered by OSSs. These users can be any of the groups defined in User types enumeration, back in Expresse architecture design section.
To end with, on left-upper side it can be found a number of users or systems receiving information from Expresse. Those systems normally receive a CSV file with the results of dynamic reports, that can contain be consumed by many different departments.
86 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 29. Region representation for big deployments
Customers having a big network offering services to more than 3 million customers may have
Expresse operations split in smaller regions. But number of customers is not the only aspect that can lead to region splitting: management network bandwidth can also require to install Expresse servers
“closer” to the region they are serving, for network bandwidth usage optimization.
Figure 30. Full-tier application server
Enterprise architecture automation with opensource tools 87
Enterprise architecture automation with opensource tools
Figure 31. Role-specific application servers
Figure 30 and Figure 31 show different grades of function-specialization for application servers, from the simplest (first one) to the most complex (last one). The latter applies to deployments of
Expresse in very big networks, over 6 million lines, although they can also be found with web and application servers integrated, but mirrored, as specified in
Figure 32. Expresse simple DB architecture
88 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Figure 33. Expresse complex DB architecture
Figure 32 and Figure 33 show different grades of function-specialization for DB servers, from the simplest (first one) to a more complex (last one). Database technology used for Expresse is Oracle
Enterprise Database. For the first case, a single server with this product installed would be enough, having just one Oracle instance accessing stored in local disks. For the second case, Oracle RAC is needed, so that many Oracle instances can be deployed on different servers all accessing same data, stored in a Storage server. A virtual IP would exist to access Oracle SCAN, which is a technology that provides a logical clustered layer that enables accessing an instance on DB, and provides a mechanism that make nodes failure transparent to users.
Figure 34. Expresse DB servers synchronized with Oracle Data Guard
Enterprise architecture automation with opensource tools 89
Enterprise architecture automation with opensource tools
Another scenario that can be found in Expresse deployments is the one represented in Figure 34. In this case, data storage is kept synchronized with a stand-by database, so that downtime related to main database failures is decreased significantly.
Figure 35. Most complex Expresse architecture
The ultimate architecture, devised for customers where Expresse becomes a systemic OSS, is showed in above figure. There it can be observed that server replication exists for all types of servers, having web server role integrated in application server roles, and n servers with application server role (3 for this example), having DcPc configured in distributor mode. DcPc role servers have also replication, with 2 defined regions having stand-by servers that would be enabled automatically in case of failure of main DcPc servers, and database replication, both for database access servers
90 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
and for data layer. Additionally, this architecture covers full-deployment replication, so in case of power outage of one data center, the other one would be ready to work in a short lapse of time.
Enterprise architecture automation with opensource tools 91
Enterprise architecture automation with opensource tools
92 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Annex B: Open Source licenses of analyzed software
In this annex, characteristics of different open-source software licenses of products considered for this project can be found. A brief description of them is included (for complete information, refer to
[7]):
MIT: There are no restrictions in using this software, even for commercial purposes. The only
conditions are the following:
o Copyright must be included
o License must be included
o Software author cannot be held liable. This means that MIT-licensed software author
may not be responsible of any situation related to this software, or its usage.
Apache License 2.0 (Apache-2.0): Close the MIT license. Additional points:
o Changes must be stated
o NOTICE files must be updated/appended with attribution notes any time the original
software had “NOTICE” files.
o Cannot use contributors' names, trademarks or logos.
GNU Lesser General Public License v3 (GPL-3.0):
o Copyright must be included
o License must be included
o Install instructions must be installed in case the software is used to build (as a part of)
a consumer device
o Changes must by published/stated, and modification dates must be tracked in source
files
Enterprise architecture automation with opensource tools 93
Enterprise architecture automation with opensource tools
o References to install original software instructions, or the place where they can be
obtained, must be included
o Software author cannot be held liable
o Products cannot be sub-Licensed. This implies that any derivative work may be
redistributed only under LGPL, but applications using that library don’t have to.
o Contributors have right to practice patent claims
GNU General Public License v3 (LGPL-3.0). This license is very close in its definition to
LGPLv3:
o In addition to LGPLv3 points, just add the point that any change done or any other
software that includes GPL-licensed code must be made available under GPL, along
with build and install instructions.
GNU Affero General Public License v3 (AGPL-3.0). It also is very close to other GPL
licenses, with the particularity that is it is focused on network/web software. Nothing is stated
in terms of patent claims in this case
94 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Annex C: User Guide
Since the purpose of this Project is have a set of tools to automatically deploy a monitoring
framework, in this chapter we are including the different commands that may be used to achieve that,
and some variations that may apply.
1. Install Ansible
To install Ansible in DepServ, corresponding guide can be found in [16]. For this case, epel repo was
enabled so that yum package manager could be used. This way, to install the package we needed to
perform the following operations (as root user):
Create /etc/yum.repos.d/epel.repo file with the following content:
[epel]
name=Extra Packages for Enterprise Linux 6 - $basearch
#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=$basearch
failovermethod=priority
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
[epel-debuginfo]
name=Extra Packages for Enterprise Linux 6 - $basearch - Debug
#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch/debug
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-debug-6&arch=$basearch
failovermethod=priority
Enterprise architecture automation with opensource tools 95
Enterprise architecture automation with opensource tools
enabled=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6 gpgcheck=1
[epel-source] name=Extra Packages for Enterprise Linux 6 - $basearch - Source
#baseurl=http://download.fedoraproject.org/pub/epel/6/SRPMS mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-source-6&arch=$basearch failovermethod=priority enabled=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6 gpgcheck=1
Clean yum cache: yum clean all
Install Ansible package using yum: yum install ansible
Check that Ansible has been correctly installed: rpm -qa | grep ansible
With above steps, Ansible software gets installed.
2. Create directory structure and place corresponding files
Although it is not mandatory, most users follow best practices when using Ansible playbooks, to ease playbook sharing with community. We have tried to follow these best practices, by applying the following directory structure:
expresse_monitoring: The root folder
o roles: a folder containing required roles
. install_influx:
README.md: role documentation in markdown format
96 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
tasks
o main.yml: File containing all tasks needed to deploy InfluxDB
handlers
o main.yml: File containing handler configuration, to indicate
Ansible how to restart InfluxDB.
defaults
o main.yml: File containing all default variables needed for this
role
templates
o influxdb.conf: configuration template
files
o influxdb-1.7.6.x86_64.rpm: rpm package previously
downloaded from [19]
. install_grafana
README.md: role documentation in markdown format
tasks
o main.yml: File containing all tasks needed to deploy Grafana
handlers
o main.yml: File containing handler configuration, to indicate
Ansible how to restart Grafana.
defaults
o main.yml: File containing all default variables needed for this
role
templates
Enterprise architecture automation with opensource tools 97
Enterprise architecture automation with opensource tools
o customer_datasource.yml.j2: template that contain data source
for a specific customer, that will be used by Grafana.
o sysmetrics_dashboard.json.j2: Dashboard definition,
containing by default just one customer server showing
statistics
files
o dashboards.yml: Generic configuration to indicate provisioned
dashboards path
o grafana-6.2.2-1.x86_64.rpm: rpm package previously
downloaded from [8]
. install_telegraf:
README.md: role documentation in markdown format
tasks
o main.yml: File containing all tasks needed to deploy Telegraf
handlers
o main.yml: File containing handler configuration, to indicate
Ansible how to restart Telegraf.
defaults
o main.yml: File containing all default variables needed for this
role
templates
o telegraf.conf.j2: template that contain configuration for a
specific customer, that will be used by Telegraf in servers not
hosting Expresse database.
98 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
o telegraf_db.conf.j2: template that contain configuration for a
specific customer, that will be used by Telegraf in servers
hosting Expresse database
o execSQL.sh: Script to get Expresse statistics.
o execSQL_sysdba.sh: Script to get Oracle statistics.
o 15min: folder containing files with SQL queries to get statistics
gathered each 15 minutes. Files are not included in this
document for clarity.
o hourly: folder containing files with SQL queries to get statistics
gathered each hourly. Files are not included in this document
for clarity.
o daily: folder containing files with SQL queries to get statistics
gathered daily. Files are not included in this document for
clarity.
o
files
o telegraf-1.10.4-1.x86_64.rpm: rpm package previously
downloaded from [19]
o environments: a folder containing Ansible inventories.
. hosts: Inventory used for this project
o playbooks: a folder containing ansible playbook
. main.yml: a playbook that just imports corresponding roles, depending on the
scenario to be implemented
Enterprise architecture automation with opensource tools 99
Enterprise architecture automation with opensource tools
o ansible.cfg: a file containing Ansible configuration for playbooks stored in
corresponding folder. Since it is so simple, next we will include its contents:
[defaults]
roles_path = ~/expresse_monitoring/roles display_skipped_hosts = false
3. Configure Ansible inventory
Ansible inventory is a key element in this project. We will indicate here relevant aspects that may be considered when implementing it:
Global variables: Default values for global variables are set in order to better manage all
roles. All of them can be overridden by group variables or host variables. Below
corresponding list:
o ansible_become_pass: Superuser password
o ansible_become_method: To indicate method used by Ansible to become superuser
o ansible_user: User for Ansible to SSH
o ansible_ssh_pass: Password for SSH connection.
o download_folder: Path where Ansible will upload rpm packages.
o customers_to_monitor: All target customers for this project.
Groups of servers: Several groups of servers may be defined, to tell Ansible where has to
perform each task easily. Below we those defined:
o servers: Those servers labeled as ServN, where Telegraf must be installed (with no
Expresse/Oracle KPIs collection). This group will have the following variables:
100 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
. ansible_user: To indicate user that Ansible must use to SSH
. ansible_ssh_pass: corresponding password
. install_telegraf: variable to tell the corresponding role that Telegraf must be
installed, not including Expresse/Oracle KPIs monitoring
o db_servers: Those servers labeled as DBServ, where Telegraf must be installed
(additionally enabling with no Expresse/Oracle KPIs collection)
. ansible_user: To indicate user that Ansible must use to SSH
. ansible_ssh_pass: corresponding password
. install_telegraf: variable to tell the corresponding role that Telegraf must be
installed, not including Expresse/Oracle KPIs monitoring
o monitoring_db_servers: Those servers labeled as MonServ or ASSIAMonServ where
InfluxDB must be installed. Corresponding vars are:
. install_influx: To indicate that this software may be installed for this group
o monitoring_gui_servers: Those servers labeled as MonServ or ASSIAMonServ,
where Grafana must be installed.
. install_grafana: To indicate that this software may be installed for this group
o customer specific group: A group of servers named with a label listed in
customers_to_monitor global variable. It must contain all servers that must be
monitored, no matter the server role. It will be used to be able to limit playbook
execution to specific customers.
4. Configure Ansible playbook
This playbook will be pretty simple, since it will just contain import commands to add tasks implemented for above defined Ansible roles. The content will be the following:
Enterprise architecture automation with opensource tools 101
Enterprise architecture automation with opensource tools
- hosts: monitoring_db_servers
tasks:
- import_role:
name: install_influxdb
- hosts: servers,db_servers
tasks:
- import_role:
name: install_telegraf
- hosts: monitoring_gui_servers
tasks:
- import_role:
name: install_grafana
5. Run Ansible playbook
Depending on the activity to be performed, commands may be different. Since this project is to enable full installation of monitoring solution for Expresse for a customer, the command to run, once every previous step is done, is the following:
ansible-playbook -i environments/hosts playbooks/main.yml --limit
102 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Annex D: Deployment automation tool analysis matrix
Ansible SaltStack Rudder CFEEngine Product/Architecture License type GPL-3.0 Apache 2.0 GPLv3+ GPLv3 Launch date 2012 2011 2012 1993 Last version and date 2.7.7 5.0.1 (2018-10-19) 3.12.0(2018-06-28) Implementation language Python Python C C Yes, very active Community Yes Not very big Yes, very powerful (like live chat) Yes* (Tower Not available in UI Web/CLI/API edition) Community version Configuration files language YAML YAML YAML Own format RAM used by central server Up to 4GB >= 2 GB
max_space = number of Depends on Directives * 100MB* number of Disk used by central server playbooks number of Nodes * nodes retention duration in days * 400 kB Can work Yes, and must have Normal operation: agentless, also can connectivity with Requires Agent in every host Yes agentless have agents called root server minions somehow RAM used by agent 20MB 30 MB
Disk spaces required by agent - 500MB 256 MB
103 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Ansible SaltStack Rudder CFEEngine cf-execd (cron), cf- serverd (file server Root/Policy server and agent (PostgreSQL) + Type of nodes/elements rpm package communication), Agent (+ relay cf-monitord( server) configuration check), cf-agent Method to connect hosts SSH SSH tcp 5309 Yes: tcp 5308 Yes: 514 (tcp/udp), Hosts must connect server tcp 4505/4506 Yes: tcp 5308 tcp 5309, tcp 5310 Yes, up to 250 Yes, can even have Yes, by installing Scalability nodes (community Very scalable SW load balancing the agent ed) Own command Not a one-shot Requires a steep Ease of use Very ease interface deployment tool learning curve Yes. Many modules Yes, many modules Very extensible, Extensibility available for task available plugins for AWS automation Vagrant - VMWare Agents require Dependencies >=Python v2.6 - VirtualBox syslogd Very fast. Can work Provides platform with intermediate abstraction, with nodes that spread minions used Other comments configs to other locally to convert nodes not instructions to local accesible to the system first one Framework usage
104 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Ansible SaltStack Rudder CFEEngine Allows role-based controls No Yes No Deployment automation Application management Yes Yes Yes Yes Orchestration Yes Allows orchestration Yes Yes Yes Yes Configuration Management Continuous configuration Yes Yes Yes Yes monitoring/fixing
File blacklist/greylist/etc Yes Yes Configuration compliance Not available in Yes Yes* Yes reports community version User/group management Yes Yes Yes Yes File template Yes Yes No Notifications No Yes, via reporting No Automatic Inventory Yes Yes No Automatic Configuration Yes Yes ? rollback Configuration replication from Manual Yes ? specific node Services monitoring Yes Ensures a service/process is Yes Yes? Yes Yes running
Supports maintenance mode Yes Yes ? Resources monitoring Yes Allow HW resources Yes No monitoring
105 Enterprise architecture automation with opensource tools
Enterprise architecture automation with opensource tools
Ansible SaltStack Rudder CFEEngine Allow Net monitoring No
Allow disk monitoring yes No
Ad-hoc task execution Yes Allow custom task execution Yes yes ? No SQL Database Management SQL Database Management Yes Yes ? Yes Type of environments Yes, salt-cloud and Virtualización + cloud Yes salt-virt
106 Enterprise architecture automation with opensource tools