Student Name: Matthew Boyd Email: Matthew
Total Page:16
File Type:pdf, Size:1020Kb
2019 Student Name: Matthew Boyd Email: [email protected] Department IT-CDA-IC Project Title: MAlt: Microsoft Alternatives Mail: Performance Metrics Supervisor: Giacomo Tenaglia Paweł Grzywaczewski CERN Meyrin, Switzerland 1 CONTENTS I Description of Technologies considered 5 II Collectd 5 III Telegraf 5 IV InfluxDB 5 V cAdvisor 5 VI vAdvisor 5 VII Prometheus 5 VIII Grafana 5 IX Technologies used 5 X First Task 5 XI Second Task 6 XII Third Task 6 XIII Scrum Master 6 XIV Zurich 6 References 6 MALT: Microsoft Alternative Project Mail: Metrics Monitoring Matthew Boyd Abstract Due to the fact that Microsoft has dropped CERN’s academic institute status, therefore leaving licensing 10 times more expensive, CERN has had to come up with an alternative to Microsoft products to try and phase out the need for their products and in turn save money. Mail is currently predominantly on Microsoft’s Exchange servers. CERN is migrating its mail system, hosting 50TB+ of data on 40,000 mailboxes, from Microsoft Exchange to Kopano, an Open Source Linux-based system. The aim of this project is to consolidate the various system and application metrics in use in the new systems, and to develop on top of that an uniform visualisation layer, providing a facility to implement anomaly detection and event correlation. The data will be collected from different services, ranging from systems performance / monitoring, to web servers and back-end databases, to migration-specific tools. The project will make extensive use of free and open source tools such as Collectd, Telegraf and Filebeat to collect the data, ElasticSearch and InfluxDB to store it, and Kibana/Grafana to visualise it. The main operating system that will be used for the project is Cent-OS Linux version 7. Other tools that are going to be heavily used are Puppet (configuration management system), OpenStack (Infrastructure-as-a-Service cloud computing), Openshift (Platform-as-a-Service for web hosting) and Docker/Kubernetes (container platform). 3 MALT: Microsoft Alternative Project Mail: Metrics Monitoring Matthew Boyd Acknowledgements There are a lot of people I would like to thank for my experience here at CERN. First and foremost, I would like to thank my supervisors Giacomo Tenaglia and Paweł Grzywaczewski as well as my team: Riccardo Candido, Dominik Taborsky, Marija Predarska and Leopold Gattinger. Their continued support allowed me to carry out my task effectively through answering the numerous questions I posed. Additionally, I would like to thank my family and girlfriend for supporting me whilst I was living in a foreign country for the first time and for visiting me. 4 I. DESCRIPTION OF TECHNOLOGIES CONSIDERED VII. PROMETHEUS Throughout the project, there were a lot of technologies Prometheus [7] is an open-source monitoring solution. that were evaluated for the given task. The following list Prometheus uses key-value pairs to store data regarding shows all the possible technologies that were considered as metrics. Prometheus allows for many different interactions well as a brief overview of their purpose. from third parties out-of-the-box. Additionally, it provides a graphical interface to see the results that are collected, as well as providing integration support for Grafana. • Collectd • Telegraf VIII. GRAFANA • InfluxDB Grafana [8] is an open-source representational tool for • cAdvisor displaying results provided by a time-series database visually • vAdvisor through the use of graphs and charts. There are a range of • Prometheus different visualisations that can be used such as heatmaps to • Grafana histograms, and geomaps. There is native support for a lot of databases within Grafana, including Prometheus and InfluxDB. II. COLLECTD As well as allowing for collaboration whereby all users will Collectd [1] is a daemon (runs as a background process) be able to see the dashboards simultaneously. which will collect system and application metrics periodically IX. TECHNOLOGIES USED and allows for these metrics to be sent to an appropriate For data collection: datastore. As an application it is very minimal. It simply The technology that was chosen for collects data and does not display it visually in anyway. data collection was Telegraf. The main reason for this was that Telegraf came with a predefined configuration setup that allowed for many inputs and outputs to be easily added, as III. TELEGRAF well as the fact that when testing Telegraf, it was able to also Telegraf [2], like Collectd is a metric collection agent. It gain metrics that other collection applications such as Collectd can collect metrics from a wide array of predefined input and CAdvisor were not able to find. Additionally, work had configurations, such as docker, CPU, disk, etc. It will then already commenced on part of the monitoring systems using write these metrics to a pre-configured output source, i.e. Telegraf to get application metrics, therefore to keep inline a time-based database. Not only does it have 200+ plugins with the pre-existing technology stack to preserve uniformity, already written for it, therefore only requiring some tweaks to Telegraf was chosen. the configuration, it also allows a lot of flexibility in running For data storage: InfluxDB ended up being the data storage your own scripts to monitor statistics. application of choice. This decision was made for a number of reasons, as there are pros and cons for both Prometheus IV. INFLUXDB and InfluxDB . For instance, both of them are written in GO language, therefore meaning that they are very quick to InfluxDB [3] is an open-source time-series database used be executed, however, in terms of Prometheus for instance, for monitoring metrics. Influx stores all the data with a time- it is bound by a schema, whereas InfluxDB is schema-free. stamp in epoch which can then be used in conjunction with InfluxDB also supports more languages than Prometheus and programs such as Grafana to represent the data visually due additional access methods (JSON over UDP). As well as the to influx only being a datastore. fact that InfluxDB’s query language is very powerful and development friendly. V. CADVISOR For data visualisation: Grafana was used over the built- CAdvisor [4] is an open-source application that allows for in visualisations of Prometheus. The main reasons behind this analysis of resource usage and performance characteristics of decision was that Grafana has more flexibility as well as being running containers. Again, like with Telegraf and Collectd alot easier to use than Prometheus. Grafana contains many rich it is a daemon that runs and will collect information about features to allow the user to display statistics easily through a docker containers, for instance: CPU usage, DISK space, etc. pluggable panel architecture. These statistics can then be sent to a database and displayed X. FIRST TASK in Grafana. The first task set out by my supervisor was to use an existing virtual machine to gather metrics around the emailing service VI. VADVISOR Postfix. Metrics were initially to be collected through Collectd VAdvisor [5] is an open-source sister software application to and then sent to the InfluxDB server and displayed through cAdvisor that is used to get information from virtual machines Grafana. running on a server. This will produce the same type of metrics The first operation carried out was editing the configuration as in cAdvisor: CPU usage and DISK space. of Collectd in /etc/Collectd.conf. Within the configuration, the ¯ 5 write graphite plugin had to be enabled, the database team at in the team who may have additional information surrounding CERN were able to process the request to install the graphite the subject. Each sprint would last approximately 2 weeks, plugin on the InfluxDB server to allow for the influx database and seeing that I was present for 8 weeks, I was delegated the to process the data in the correct format. role of scrum master for 2 weeks. This consisted of reminding After discussing with the team about Collectd, it was everyone about scrum, and taking notes in case any one was decided that Telegraf was the better option. The change from off and for auditing purposes. Collectd to Telegraf required some additional effort. Firstly, XIV. ZURICH the Telegraf configuration had to be amended to allow for Postfix metrics to be collected. For the metric collection, Additionally, as part of my experience in CERN, I was a script was written in python in order to scan through able to go to Zurich with the OpenLab Students. the maillogs and match given metrics. Once the script was At the start of the day, we went to ETH Zurich and we complete it ran every 5 seconds and the results were pushed were given two talks. The first was about the new internet to the InfluxDB server which was connected to Grafana in infrastructure known as SCION (Scalability, Control, and order to present the results visually. Isolation on Next-Generation Networks) [6] that is being rolled out into Swiss Banks. This new internet infrastructure XI. SECOND TASK is able to reduce the downfalls and bottlenecks of the current Next, I was given the task of getting information for each internet infrastructure by adding failure isolation and route of the docker containers. Examples of the information that control. needed to be extracted included: CPU usage time, file storage Then we were given a talk on the future of drones, and what amounts, load time, etc. This was to ensure that in the event of is in store for operating them autonomously. We found out a docker container going down, there would be metrics to show that drones will be used more extensively in sporting events this allowing other users to be able to act accordingly in a short once they find out a way in order to allow for full tracking amount of time.