Masaryk University Faculty of Informatics

Implementation of Systems for Intrusion Detection and Log Management

Master’s Thesis

Severin Simko

Brno, Spring 2018

Masaryk University Faculty of Informatics

Implementation of Systems for Intrusion Detection and Log Management

Master’s Thesis

Severin Simko

Brno, Spring 2018

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Severin Simko

Advisor: doc. RNDr. Tomáš Pitner, Ph.D.

i

Acknowledgements

I would like to thank my family for their continuous support.

iii Abstract

Increasing IT threats requires advances and sophisticated IT se- curity solutions that can keep the organizations and companies safe. Log Management and Intrusion Detection Systems are one of these solutions. The aim of this master thesis was to examine the Log Man- agement system Graylog and the Host-based Intrusion Detection sys- tem OSSEC. Both systems were analyzed, deployed and integrated into the IT infrastructure of the AXENTA a.s. company. According to the Graylog throughput testing that was performed in this project, we determined whether the Graylog meets the requirements for the AXENTA a.s. business purposes. This thesis describes both technolo- gies, presents the deployment details and integration of both of these systems, and finally interprets the testing results.

iv Keywords

Graylog, OSSEC, log management, intrusion detection, security, log, log analysis ...

v

Contents

1 Introduction 1

2 Overview and Description of Technologies 3 2.1 OSSEC ...... 3 2.1.1 What are IDS systems ...... 3 2.1.2 Understanding of OSSEC and its Key Features .4 2.1.3 Client-Server Architecture ...... 8 2.1.4 OSSEC Alternatives ...... 10 2.2 Graylog ...... 11 2.2.1 Log Management ...... 11 2.2.2 Technology Description and Key Features . . . . 12 2.2.3 Lifecycle of a Log ...... 12 2.2.4 Graylog Components ...... 15 2.2.5 Graylog Alternatives ...... 17

3 Deployment 21 3.1 Introduction to the Project ...... 21 3.1.1 Project Environment ...... 22 3.2 OSSEC ...... 24 3.2.1 Server Configuration ...... 24 3.2.2 Alerting, Notifications, and Reporting . . . . 26 3.3 Graylog ...... 29 3.3.1 Architectures Deployed ...... 29 3.3.2 Server Configuration ...... 32 3.4 OSSEC and Graylog Integration ...... 46

4 Testing and Results 49 4.1 Graylog Throughput Testing ...... 49 4.1.1 Graylog Buffering ...... 50 4.1.2 Testing Explanation ...... 53 4.1.3 Results and Findings ...... 55 4.2 Summary of OSSEC ...... 59

5 Conclusion 63

Bibliography 65

vii

List of Figures

2.1 Example File Integrity Configuration 5 2.2 Archived Logs Example 6 2.3 The client and server OSSEC services 9 2.4 Example Lifecycle of a Log in Graylog 14 2.5 High-level overview of Graylog components 16 3.1 AXENTA a.s. Log Management Architecture 22 3.2 OSSEC syslog-ng folder monitoring 25 3.3 OSSEC syslog-ng rules 26 3.4 The configuration of OSSEC Alerts 27 3.5 High-level OSSEC Processing 28 3.6 Graylog Cluster Architecture 32 3.7 Not parsed logs from pfSense firewall 35 3.8 Number Of Logs Received shown on Histogram 38 3.9 List of Devices sending logs as a Quick Values Analysis 38 3.10 Source IP Addresses shown using the Geo-Location Plugin 40 3.11 Multi-tenancy REST-API command for revoking Roles 44 3.12 Graylog Cluster Architecture 47 4.1 Graylog Internal Processing - Buffering 51 4.2 Testing throughput script 55 4.3 Graylog Throughput Testing Results 57 4.4 Graylog Server Monitoring Showing the Processes and CPU Usage 59

ix

1 Introduction

Every organization and company, regardless of the size, needs to have a secure IT. Intrusion Detection and Log Management most definitely belong to the most important parts of the IT security. Organizations use the log analysis, which is a part of the log management, to become aware of the security events that can potentially affect the entire organi- zation and allows them to perform an in-depth log analysis. Intrusion Detection Systems are used for detection of anomalies and unusual behavior by analyzing the network traffic and logs and monitoring the remote servers. Graylog is an open-source log management system, and OSSEC is an open-source Host-based Intrusion Detection System that provides multiple features, such as File Integrity Checking, Log Analysis or Detection. Both systems are described in chap- ter . This chapter provides the high-level overview of both systems, explains their features and configuration. The main goal of this project was the analysis, successful deploy- ment and testing of these systems in the AXENTA a.s. infrastructure. AXENTA a.s. is a Czech IT company that deals with the IT security and provides advanced and sophisticated Log Management and IT security solutions. According to the analysis and deployment, it was necessary to determine whether or not these systems are useful for the AXENTA a.s. purposes, and if yes then for which use-cases. The main part of this thesis focuses on providing details of the deploy- ment process in this infrastructure and summarizing the experiences and problems that were encountered in the process of deployment. The details about the deployment and integration of both systems are summarized in the chapter 3. One part of this project deals with the Graylog throughput taesting that was performed on two different servers and in a total of five different use-cases. The testing consists of 40 separate tests toget the most accurate throughput results. The findings and results are summarized in chapter 4.

1

2 Overview and Description of Technologies

This chapter explains both technologies used in the project in detail, it describes their use, features, functionalities and should provide the technical background required for the general understanding of both technologies and the whole project itself. As mentioned above, the two main technologies used in the project were the open-source HIDS1 security tool called OSSEC and the open-source Log Management system called Graylog.

2.1 OSSEC

OSSEC is an open source Host-based Intrusion Detection System (HIDS) that performs log analysis, file integrity checking, rootkit de- tection and real-time alerting. OSSEC provides centralized, multi- platform architecture that allows managing the security of computers from one central place.

2.1.1 What are IDS systems Host-based Intrusion Detection Systems (HIDS), together with the Network-based Intrusion Systems (NIDS) are subgroups of the Intru- sion Detection Systems (IDS). IDS is a network security system that monitors network traffic to detect suspicious and potentially malicious activities. Such activities may indicate a system or network attack from someone who is trying to compromise the data or the whole system. IDS are used to monitor the entire network, a portion of a network, or an individual system.[1] IDS use sophisticated detection methods and raise alerts when such activities are taking place. Although, both NIDS and HIDS are used for security management for networks and computers, they work differently and are used for different purposes. NIDS is installed on a strategic point in the network infrastructure and provides broader traffic examination than the HIDS. NIDS is listening to all the packets going through this strategic point and is

1. Host-based intrusion detection system

3 2. Overview and Description of Technologies monitoring the whole network segment. That’s why this network- based idea is not a perfect way to monitor a particular host. The reason for that is that there can be an alternative path to this host and in that case, the intrusion will not be detected. On the other hand, HIDS runs as a service or an agent installed on a certain network endpoint and monitors the unusual activity only for this endpoint. HIDS monitors setting on the server such as critical system or configuration files, or file checksums and so protects the file or registry integrity. HIDS is often an after-the-fact tool because it monitors log files to find the anomalies, whereas the NIDS ismuch more real-time because it monitors the packets going through the network right now. Both systems have to be fine-tuned to eliminate the false positive alerts.[2]

2.1.2 Understanding of OSSEC and its Key Features OSSEC can check the integrity of system files, detect and has a powerful log analysis engine capable of analyzing almost every type of logs created on a system. The log analysis can be done for some services such as Apache, Bind, LDAP and also 3rd party logs from devices like Cisco. Apart from this, OSSEC contains active response module that can respond to detected attacks or threats. File Integrity Monitoring: Also called a syscheck, is a periodic validation of the integrity of or application files by comparing current file state and known, stored value. It is avery important part of the intrusion detection, and it often uses the crypto- graphic functions to calculate the checksums for detecting changes or modifications. OSSEC uses MD5/SHA1 checksums for monitoring crucial configuration files in a system. OSSEC supports two versions of validations; validation in an user-defined period, by default set to every 6 hours, or near-real time. The near-real-time version is sup- ported on Windows and modern distributions, such as Ubuntu or CentOS. OSSEC agent scans the system in a given period and sends the checksums to the central server where the known values are stored. Syscheckd service on the central server then compares both checksums and determines whether the file was changed or not. The integrity monitoring offers a variety of configuration options. A user cande- fine which directories or files will be monitored, which cryptographic

4 2. Overview and Description of Technologies

/etc /bin

Figure 2.1: Example File Integrity Configuration

function will be used, which specific scan-times should be performed, of whether the detected modifications should be alerted or not. [3] The 2.1 figure shows the configuration example for the file integrity monitoring.[4] It demonstrates the use of real-time configuration pa- rameter for the /etc folder and the use of standard integrity check option for the /bin folder. Real-time Log Analysis The log analysis also called log inspection or log monitoring is an examination of a system event (log) to detect the unusual behavior. OSSEC supports real-time log analysis which means that the event is examined immediately after it’s generated. In the OSSEC, there are two different processes responsible for the log monitoring; logcol- lector and analysisd. Logcollector processes in running on the client, and it’s responsible for monitoring the system for newly generated events and for collecting them. The analysisd process is running on the master instance, and it’s responsible for decoding, filtering and classifying events. The analysisd process is parsing logs according to the pre-defined rules into different fields, and it’s trying to detect keywords that indicate the unusual behavior. Such keywords are, e.g., malformed, denied, failed or invalid. OSSEC provides a default set of rules for different types of logs and different formats. The rules are written in XML, so it’s straightforward and easy to create own rules or modify the default ones. Apart from the default Syslog for- mat, OSSEC supports another 14 different log formats (tested version OSSEC 2.8.1), including Apache, eventlog, snort and multi-line logs. OSSEC provides two different methods for the analysis. The first one is, as mentioned above, the log file analysis and the second one, called process monitoring is used when there is information that we want to monitor, but this information is not included in any log file. In this case, OSSEC can monitor the output of a specified command. The logs can be either stored or discarded right after the analysis. The log archiv- ing is done on the master server, and the archive logs are stored in

5 2. Overview and Description of Technologies

2018 Feb 20 15:11:33 (gitlab) 192.168.XX. XX −>rootcheck Process ’31444’ hidden from /proc. Possible kernel level rootkit. 2018 Feb 20 03:53:20 (c6test01) 192.168.XX. XX −>syscheck Integrity checksum changed f o r:’/etc/logrotate.conf’ Size changed from ’689’ to ’696’ Old md5sum was: ’4cb185b696b63daf9b414a8b8e703a48’ New md5sum is : ’49c43fbd3aff975807667348c88cfa57’ Old sha1sum was:’de349711cdbf55da2480f6ca90cab1afc5405026’ New sha1sum is : ’8a4804d596d5c0b5dbf4ac2a8410c8c708c59b4a’ 2018 Feb 20 23:54:56 (c6test01) 192.168.XX. XX −>df −P ossec: output:’df −P’:/dev/sda1 487652 102949 359103 23%/boot 2018 Feb 20 05:24:02 (c6test01) 192.168.XX. XX −>/var/log/messages...

Figure 2.2: Archived Logs Example the ".../archives/YEAR/MONTH/ossec-archive-DAY.log.gz" structure. OS- SEC attaches the header to the archived logs in the "YEAR MONTH DAY HH:MM:SS (AGENT_NAME)IP_ADDRESS->/PATH/TO/LOG or PERFORMED_ACTION" format. The Figure 2.2 shows the example of archived logs on the mas- ter server. We can see four different types of performed actions in the following order: rootcheck, syscheck, process monitoring and file monitoring. In the syscheck log example, we can see the detected modification of the file. [5] Rootkit Detection A rootkit is a malicious application hidden in the system designed to provide access to the system and take administrator/root control over a system. There are two different types of rootkits. First is the user-mode rootkit operating on a user-level in an operating system. This rootkit modifies important application files, and so provides the backdoor access to the system. The second one is the kernel-mode rootkit, this type is more sophisticated and can replace or add portions of the core operating system. This type can heavily affect the system and is often a cause of many system crashes. It’s notoriously difficult to detect the presence of a rootkit on the system, but OSSEC provides sophisticated methods to detect and identify both types of rootkits. [6] OSSEC uses the rootkit detection engine, also called rootcheck, which consists of multiple parts.

∙ Checking the rootkit_files.txt file that contains a known set of rootkits and files that these rootkits are using. The rootcheck runs system calls, such as fopen() to open and check these files and detect the presence of a user-mode rootkit.

6 2. Overview and Description of Technologies

∙ Checking the rootkit_trojans.txt file that contains the trojan signa- tures, this helps to detect binary codes modifications. A rootkit can add or modify data bytes into the executable code. Every file modification changes the file hash and so by comparing the hashes this kernel-mode rootkit can be detected.

∙ Looking for presence of hidden processes using getsid() and kill() system calls.

∙ Looking of presence of hidden ports using bind() service.

∙ Looking for network interfaces that are not in the promiscuous mode. Promiscuous mode allows a network device to read and analyze all incoming packets. So by disabling this mode some potentially malicious packets can remain undetected.

It’s important to mention that the rootcheck runs on each server where the OSSEC client is installed separately.

OSSEC provides two types of configuration files. It’s partly related to the Client-Server architecture supported by OSSEC and explained in the 2.1.3. OSSEC supports share configuration files that are shared for each OSSEC client-server and local configuration files used for specific configurations on particular OSSEC client servers. The local configuration file is of higher importance, and so this configuration is preferred over the shared one.

∙ The /var/ossec/etc/ossec.conf is a local configuration file used only for that particular server, and

∙ the /var/ossec/etc/shared/agent.conf is a shared configuration file among all OSSEC clients in the infrastructure and it’s applied on each of them.

Alerting and Active Response OSSEC comes with a pre-configured set of active response tools that can be triggered by a client or server as a response when a condi- tion for active response was met. These tools can for example: ∙ add an IP address to /etc/hosts.deny file,

7 2. Overview and Description of Technologies

∙ add an IP address to iptable’s deny list, or

∙ can add a rule to block an IP address to firewalld service (Linux).

It’s possible to configure the conditions for each response and also configure the commands with specific parameters that canbe executed.

OSSEC also supports multiple options for alerting. When an inci- dent is detected, OSSEC server can generate an alert log that can be forwarded via syslog to another server. Apart from the alert logs, OS- SEC supports sending alerts via e-mail. For this purpose, OSSEC uses the maild service built in the OSSEC. Outputs to different databases or storing alerts in a file are also supported. Alerting, notifications, reporting and active-response is described in details in the section 3.2.2

The figure 2.1.2 shows an overview of the services mentioned above and provides a clear picture of how individual services work and how are they communicating with each other. There is an agentd daemon running on the client side that communicates with the server and, on the other side, the remoted daemon that communicates with the client. Apart from the agents responsible for communication, there is a service called monitord which is responsible for monitoring agent connectivity and compressing daily logs. The monitord monitors each OSSEC client associated with that particular server.

2.1.3 Client-Server Architecture This subsection explains the Client-Server Architecture and describes in detail how the communication between clients and server work. It explains the way clients are added and managed and also the security aspects of these actions.

OSSEC uses the Client-Server Architecture which means that there is a single central dedicated server on which the OSSEC server appli- cation is running. The central server, also called a manager, is moni- toring and receiving information from agents. The integrity checking

8 2. Overview and Description of Technologies

Figure 2.3: The client and server OSSEC services

databases, as well as, all the archived logs are stored in this instance. This server also contains all decoders, rules and analytic engines needed for the analysis and detection of the problems. By default, the UDP port 1514 is dedicated for the communication with agents. On the other side, OSSEC client is an application installed on the system that should be monitored. The agent collects information and forwards them to the manager for further analysis. Adding, managing and removing clients is straightforward and done from the central place. This approach makes it easy to manage architecture with a lot of monitored servers.

The management of the clients is done by using the /var/ossec/bin/- manage_agents script which is located on the manager side. To be able to add an agent to the monitoring it is required to have the OSSEC client installed on the monitored server and by using the script men- tioned above to add a record containing a unique ID, IP address and server name. The script generates a unique authentication client key which is after that imported to the client-server, and after restarting the

9 2. Overview and Description of Technologies client application, the monitoring is set up. The client’s agentd service initiates the connection to the server using the specified port ( by de- fault UDP port 1514) and expects the reply back. If the authentication keys are configured correctly the server’s remoted service replies back, and the secure communication channel is established. To remove the client from the monitoring, the unique record and the key are deleted from the server, and the monitoring is no longer active. [4]

2.1.4 OSSEC Alternatives It exists a couple of alternatives to OSSEC, and in this subsection, we will focus on two of them, namely Bro, and Tripwire.

Bro Bro is an open-source network security monitoring tool written in C++ that provides sophisticated features for the analysis and detection of threads, malware, vulnerabilities exploits, brute force attacks, etc. Bro uses an Event Engine responsible for analyzing network traffic and generate events when some unusual activity is detected, and Bro uses the Policy Scripts that analyze events created by Event Engine and create policies for actions such as email notifications, raising alerts or executing specific system commands. Bro is very flexible because it allows users to create own rules that are used for monitoring, it works efficiently in networks with large amounts of traffic and provides in-depth analysis of traffic with the multiple protocols supported. Bro has, on the other hand, large and complex architecture, and so it’s not easy to handle and configure it, and Bro requires programming experience for its configuration.

Tripwire Tripwire is a security and data integrity tool for monitoring and alert- ing on specific file changes on a range of systems. Tripwire is available as an open-source but also provides an enterprise version. It’s the only tool from the alternatives that provide the enterprise, officially sup- ported tool. This tool is only available on Linux systems and Windows is not supported. Features such as centralized control and reporting,

10 2. Overview and Description of Technologies

master-agent configuration, advanced automation are only available in the enterprise version. Open-source version is only for a small number of servers where the centralized control is not crucial. The advantage of using Tripwire is that this system is recognized by many of the leading security, auditing, and compliance certification organizations. Tripwire has great integration with Linux and is excellent for small and decentralized Linux systems. The disadvantages are that it runs only on Linux servers and it requires the Linux expert in order to configure the system correctly and effectively. In the open-source version, the real-time alerts and reports are not supported.

2.2 Graylog

This section provides a brief introduction to the log management and focuses on providing an overview of the Graylog log management system. It explains the key characteristics and features, describes dif- ferent parts of Graylog, namely ElasticSearch, MongoDB and Graylog engine, and provides information about log processing and supported log sources.

2.2.1 Log Management Log management is a process of collecting, analyzing, storing and dealing with computer-generated records, also called logs. Logs are important aspects of any production system and every component of such system, including operating system and applications, is capable of generating logs enabling effective troubleshooting and providing the ability to track actions performed on the system by providing de- tailed information of events that occurred in the system. When having an infrastructure with a lot of different servers it’s essential to have a centralized solution for log management that collects, analyses and forwards or stores logs to provide extended analysis. Without a log management solution, the logs are stored locally on each server which makes the process of investigation of what happened in the infras- tructure difficult and problematic. On the other side, log management tools make it easy to find root causes of faults or errors that occurred

11 2. Overview and Description of Technologies in the system and having all crucial data on one single dashboard makes the analysis much more efficient.

2.2.2 Technology Description and Key Features As mentioned above, Graylog in a powerful open-source log manage- ment tool which analyzes the incoming logs extracts important data from them provides search functionality and visualize the logs on the web interface. Graylog is written in Java and uses a few open-source technologies such as Elasticsearch or MongoDB. These two with the Graylog engine and Graylog UI form a competitive log management solution. This section will describe each of them and will explain how they work. Apart from the open-source version, Graylog is also offered in two other options with extended features and provided support. These two versions are called Enterprise and differ only in the provided support. User Audit Logs is one of the features which records and stores actions performed by a user or administrator on a Graylog server. Tracking all user activities with an audit log enables increasing efficiency and security in a reliable and provable way. Another feature called Offline Log Archival enables to store older data on an external storage and in case they are needed, Graylog can re-import them so they become available for search.[7]

2.2.3 Lifecycle of a Log To be able to understand the way Graylog receives, processes and forwards or stores logs we should describe each stage related to this process. A log is firstly received by the Graylog server then processed by the Message Filter Chain which is a message processor responsible for parsing, filtering and setting the static fields for a particular log. Logs are filtered according to the pre-defined rules and routed into categories called Streams. For different Streams, we can define specific field-based rules. On each Stream, there is another Index Set applied. Index Set controls how are messages stored in the Elasticsearch, it specifies, for example, the number of Elasticsearch shards or rotation and retention policies. From a Stream, the logs are either forwarded to another system or stored locally on the Graylog server.

12 2. Overview and Description of Technologies

Log Collection Graylog supports three different types of input data sources:

1. Standard protocols and formats. Syslog is the most commonly used protocol for sending event messages. The Syslog protocol can be used to log different types of events, and it’s supported by a wide range of devices. Event messages can be generated in this format by using either rsyslog which is a default log manager in common Linux distributions or syslog-ng tool. Graylog is also able to receive raw logs, in plaintext, or in JSON format. Graylog has built-in support for TCP and UDP transport protocols and Apache Kafka and RabbitMQ(AMQP2) transport queues for all of them. Collecting Graylog’s internal log messages are not supported by default, but it’s possible to use the 3rd party plugin that allows the collection of internal logs.

2. 3rd party collectors. Graylog supports a system called Graylog Collector Sidecar which is a service/daemon for Windows and Linux systems used as a log collector. Log collector installed on a server forwards log files or Eventlogs to the Graylog server. Each collector contains configuration with the in which, among oth- ers, the Graylog server address and log format are defined. The collector uses NXLog, Filebeat or Winlogbeat agents in the back- ground and these agents are responsible for collecting server logs. The collector forwards collected logs to a defined Gray- log server IP address and port, on which the Graylog server is listening.

3. GELF Graylog has its own log format called Graylog Extended Log Format ( GELF) which is a JSON string that should be used especially for forwarding and processing application logs. GELF supports many programming languages and is capable of log- ging every exception raised by a particular application. GELF provides compression and optimized structure for Graylog pur- poses.

2. Advanced Message Queuing Protocol

13 2. Overview and Description of Technologies

Figure 2.4: Example Lifecycle of a Log in Graylog

Processing

Processing of received logs is done in Graylog Streams. Streams are virtual groups of logs that allow categorization of logs according to specified rules. It means that it is possible to group logs according to different fields, such as log severity level or source IP address. Streams support two different types of rules. First is when a message hasto match all specified rules (logical AND) or when a message hasto match at least one of the specified rules (logical OR). Incoming logs are firstly processed in the Message Filter Chain. Message Filter Chain is a pipeline system responsible for parsing logs, setting static fields and assigning logs to appropriate streams. This system parses logs by a component called Extractor which extracts static fields from a log message. The structure of each log format is different, that’s why there are different extractors that can be used for different formats. Index Set is a configuration that controls how are logs stored on the Graylog server. It defines the rotation and retention policies and configures Elasticsearch storage. It’s possible to set different rotation strategies based on stream size, time or message count and retention strategies that are used to clean up old logs to prevent the over-use of

14 2. Overview and Description of Technologies

disk space. Each Stream has its own Index Set which means that for different groups of logs it’s possible to use different strategies.

Forwarding and Storage Graylog can forward logs to the other systems or store them locally on the server. Graylog supports forwarding logs to the other systems such as SIEM3 or another Linux-like server and the only supported format is GELF. Storing logs is essential for the analytical purposes. It’s important if we want to do a log analysis for different periods of time and compare the results from them, and if we want to use the search quires that shows and track the changes over time. For such cases, logs have to be available. Logs older than a given period that is not required to be available at any time should be archived. Archiving policies are configured for each Index Set. Graylog is only able to archive logs locally and doesn’t support archiving logs on other external systems such as a database or NAS4. Figure 2.4 shows different processes, including the log collection and processing, stream, indexes and forwarding that the Graylog sys- tem provides. On the figure, we can see five log inputs, namely Syslog TCP, GELF UDP, Raw, CEF TCP, GELF Kafka, that are configured for receiving logs. The figure also shows the examples for three Streams, namely Security, Apps, and Network Stream and two Index Sets, one configured for long-term log rotation, retention, and archiving and the other one for short-term strategy. We can also see the forwarding to the 3rd party system. This is an example of how the Graylog system can be configured.

2.2.4 Graylog Components Figure 2.5 shows the high-level overview of Graylog components. Gray- log consists of four main components, namely Graylog UI, Graylog Server, MongoDB, and ElasticSearch. To be able to configure the Gray- log correctly, it’s essential to understand how different components work, what are they responsible for and how they cooperate.

3. Security Information and Event Management 4. Network-Attached Storage

15 2. Overview and Description of Technologies

Figure 2.5: High-level overview of Graylog components

Elasticsearch is a very powerful and highly scalable open-source full-text search. It can search, analyze and store large amounts of data and it works as a near real-time analytic engine. It means that there is a minor latency between the time when the data are indexed and when they are available for search. Elasticsearch stores indexes in a sophisti- cated format optimized for full-text search. An index is a collection of data, in Elasticsearch called documents, with similar characteristics. Graylog uses a dedicated Elasticsearch cluster that can consist of multi- ple nodes. All the Elasticsearch nodes are defined in the main Graylog configuration file: /etc/graylog/server/server.conf. Graylog also supports automatic node discovery to get a list of available Elasticsearch nodes. Elasticsearch cluster used by Graylog can consist of multiple nodes where a node is an instance of Elasticsearch. A node can either store data or data replicas. The purpose of storing data replicas in failover is that in case the primary node crashes, the node that is storing repli- cas is promoted to the role of the primary node and no data is lost. The new nodes are added to the Elasticsearch cluster to increase the performance. That means that the performance of Graylog server is highly affected by the efficiency of Elasticsearch cluster. Elasticsearch is written in Java that’s why the heap size is another very important

16 2. Overview and Description of Technologies parameter when it comes to the performance. Heap memory is an area of memory reserved for data catching. For the optimal Graylog performance, it’s strongly recommended to use around 50% of the available system memory for Elasticsearch. MongoDB is a NoSQL database that stores data in a flexible struc- ture and JSON-like documents. Graylog uses MongoDB for storing configuration, metadata and web UI entities, such as users, rights, streams, indexes, configs, etc. MongoDB doesn’t store log data, norit has to run on a dedicated server because it doesn’t have a big impact on the Graylog server. Graylog User Interface gives access to the web interface that visu- alizes data, provides search and analyze capabilities and work with the aggregated data. Graylog UI fetches all the data via HTTP(s) from the powerful Graylog REST API. The API is used as the main commu- nication channel between the UI and Graylog server. The advantage is that with the data from REST API it’s possible to build own frontend according to needs. Graylog Server Graylog Server is a component responsible for receiving data from clients, and its main purpose is to integrate and communicate with the other components.

2.2.5 Graylog Alternatives There exist many different log management tools which provide simi- lar features as Graylog. While some tools are more flexible than the others, the purpose of their use can be different. An organization or a company that wants to have a log management solution should have a clear understanding of the purpose for which the log management tool is going to be used. Some tools can only provide basic data out of the organization’s logs while some are more powerful, enterprise solutions capable of running large scale log management system. This subsection provides information about some of the Graylog’s alter- natives, namely Splunk, ELK Stack and syslog-ng Store Box (SSB). Splunk and SSB are only available as a commercial solutions, whereas ELK Stack is completely an open-source project.

ElasticSearch-Logstash-Kibana(ELK) Stack is a simple but robust log management platform that consists of three open-source projects:

17 2. Overview and Description of Technologies

Elasticsearch, Logstash, and Kibana. ELK is recently massively popular and belongs to the world’s most popular log management solutions. Elasticsearch is already described in the 2.2.4 and it’s used for the same analytic and search purposes as for Graylog. Logstash is a server-side data processing pipeline that collects and parses data from multiple sources and sends them to the ElasticSearch. Logstash is capable of receiving logs from Kafka Queues, RabbitMQ, Beats such as Filebeat or Winlogbeat, or any other log shipper such as syslog or rsyslog. Kibana, on the other hand, is used as a visualization tool and provides charts and graphs of the data from ElasticSearch.[8]

Pros Cons

∙ For an open-source solution, ∙ Lack of advanced features, it’s easy to install and easy to such as real-time notifica- use. tions, anomaly detection, dy- namic correlations or power- ∙ Enables a lot of control and ful dashboards. customization. ∙ It’s a Stack, so it’s dealing ∙ All the components are pow- with three products which is erful, heavily maintained and more complex. well documented.

Splunk is a proprietary log management tool that offers on-premise or on-cloud setups and mostly focuses on enterprise customers. ELK and Splunk use two different approaches to achieve the same. Whereas ELK is mostly used by smaller organizations or companies, Splunk with its large range of functionality aims at the customers where a deeper understanding of data is needed. Numerous functionalities of Splunk can be too expensive and out of the scope, especially for smaller organizations. Apart from the common log management fea- tures, Splunk supports indexing logs of any type, whether structured, unstructured or sophisticated application logs.[9]

18 2. Overview and Description of Technologies

Pros Cons

∙ Built-in alerting and report- ∙ Price. ing. ∙ Complex set up process. ∙ as a Service (Saas) solution.

∙ Scales easily.

∙ Feature-rich solution.

syslog-ng Store Box (SSB) is a high-performance log management appliance that provides a powerful web-based search interface, cus- tomizable reporting, strong encryption and clear user role separation. SSB supports forwarding data to 3rd party analysis systems, such as SIEM. SSB is based on the syslog-ng which is one of the most widely used implementation of syslog protocol. The Balabit company which is the SSB creator also provides syslog-ng Premium Edition which pro- vides extended functionality such as processing of multi-line messages, disk-based buffering or a client that can be installed on a Windows server and collect and forward logs to the SSB appliance. [10] SSB uses Log Spaces for storing logs and they are similar to Stream used by Graylog. Log Spaces allow storing encrypted log and allow using different rotation and retention policies and encryption strategies.

19 2. Overview and Description of Technologies

Pros Cons

∙ Unlimited number of Log ∙ Only basic reporting - no Spaces that can have different scheduler, and the Analytical policies defined. GUI is outdated.

∙ Written in C programming ∙ No log parsers provided, a language instead of notori- customer is responsible for ously slow JAVA. writing his own parsers and patterns. SSB comes with a ∙ High performance (the pre-defined internal index- largest SSB appliance can ing but because of the limita- collect and index up to 100k tions a customer has to create events per second).[11] own patterns and parsers.

∙ No dashboards supported.

20 3 Deployment

This section focuses on providing details of the deployment of both, OSSEC and Graylog systems in the AXENTA a.s. environment. It defines the problems encountered in this process and provides details of the solutions that were needed to fix the issues. It also describes different configurations and architectures that were deployed. The end of the section is dedicated to the integration of both technologies.

3.1 Introduction to the Project

The whole project was developed for the IT company called AXENTA a.s. which deals with the information security management processes, provides analytical services and designing and realization of the opti- mal solutions for protecting customer’s information assets and critical processes. The main services provided by the AXENTA a.s. include design, implementation, and maintenance of the Security Operations Centers (SOC). The SOC provides operational and security oversight which includes advanced Log Management and SIEM solutions. The whole idea behind the project was to investigate and test both OSSEC and Graylog systems in detail and determine whether they are or not suitable for the SOC purposes. AXENTA a.s. uses the syslog-ng Store Box, described in section 2.2.5 as the main Log Management solution. If the Graylog meets the requirements it could be used as a replacement for the syslog-ng Store Box, or as an alternative solution for customers with lower requirements for the Log Management, e.g., their infrastructure doesn’t consist of too many servers, so they don’t need too powerful system or the proprietary syslog-ng Store Box is too costly for them. Graylog could also be used as a solution for the cus- tomers that only need the Log Management system and not the whole SOC. OSSEC, on the other hand, could be used as additional function- ality for SOC. That’s the main reason why these systems should have been tried and tested.

21 3. Deployment

Figure 3.1: AXENTA a.s. Log Management Architecture

3.1.1 Project Environment

The figure 3.1 shows the AXENTA a.s. Log Management architecture in which the project was deployed. There are two different environments that are used in the infrastructure, one is for the production, and the other one is dedicated to testing servers. Production servers are used mostly for the Log Management purposes, and the testing servers are used for testing new services, features, and configurations. There are different types of servers and devices, such as ticketing and monitoring systems, active directory, web or relay servers, network devices such

22 3. Deployment

as firewalls and switches, 1ESXi or external storages, etc. used in the infrastructure. AXENTA a.s. uses centralized log collection which means that there is a central server, also called a Relay Server (RS) which receives logs from all servers and devices in the infrastructure and forwards them to a log management tool, e.g., syslog-ng Store Box or Graylog. The Relay Server is a Single Point of Failure in the process of log collection which means that if this server fails, the whole log man- agement will stop working. If this server is for some reason not able to receive or forward logs, then the potentially important logs can get lost. There are defense mechanisms (buffers) for preventing logs being lost used in a case that the RS is not able to receive logs, but it’s strictly limited by to the storage on a particular server. With no logs received by the log management tool, the whole Log Management solution can be in jeopardy. That’s one of the reasons why the RS is in the High-availability configuration. Highly available RS provides failover solution in case that one of the RS servers is unavailable. As we can see on the figure 3.1, there are two RS servers deployed in the Master/Slave concept which means that the server that is up and receiving and forwarding logs is an active master server while the slave (failover) server remains passive. Once the master server became unavailable, the slave server serves as a new master server, and so no data is lost, and the process of log collection is provided. There is a virtual cluster IP address shown in the figure that is responsible for routing logs to the currently active server. The syslog-ng protocol is used as the main log shipper between the servers in the infrastructure and the log transfer between the servers and devices and RS is encrypted for servers that support TLS encryp- tion, for those that don’t support the encryption, such as switches, firewalls, printers there is a TCP/UDP transfer used. The transfer between the RS and a log management tool is always encrypted. On Windows servers, there is an agent installed which is responsible for collecting Windows Logs that are saved in the EventLog containers, the agent converts them to a syslog format and forwards them to the RS server.

1. VMware’s enterprise server virtualization platform

23 3. Deployment 3.2 OSSEC

This subsection describes the process of deployment, the configura- tion, different configuration parameters that we tried, and their impact on the OSSEC system. It demonstrates the use of alerts and notifica- tions and explains the custom-made rules for monitoring specific configuration files that are crucial for the AXENTA a.s. In our project, we used a single server dedicated to the OSSEC that served as an OSSEC Master server and this server was deployed in the testing environment. The server that we used had following configuration:

∙ Virtual Machine deployed on the VMware ESXi

∙ CentOS 6 (64-bit) with 4x CORE

∙ 8 GB RAM

∙ 60 GB HDD

3.2.1 Server Configuration In the whole project we had a few goals that we wanted to reach with the OSSEC: ∙ Configure and monitor the key servers in the AXENTA a.s. in- frastructure by OSSEC in real-time,

∙ create the OSSEC templates for both Windows and Linux OS,

∙ create custom rules for syslog-ng monitoring,

∙ configure alerts, notifications, and reporting,

∙ integrate OSSEC with Graylog, and

∙ document the process of installation and configuration to have the notes in case of a production deployment. We installed the OSSEC agents on the total of eighteen servers in the infrastructure including both Windows and Linux devices. The initial plan was to place the OSSEC Master server before the Graylog

24 3. Deployment

server, so all the logs from the infrastructure would be sent first to the OSSEC Master server and then, after the analysis, forwarded to the Graylog server. The problem with this setup was that OSSEC is not able to forward any logs, except alert logs to the other systems. OSSEC is capable of storing logs on the server, but for forwarding, we would need to use some 3rd party technology or a script that would read the logs from files and forward them to Graylog. That’s the reason whywe decided to forward only alerts via syslog to Graylog. The integration of OSSEC and Graylog is described in detail in the section 3.4. The syslog-ng is an important tool in the Log Management pro- vided by the AXENTA a.s., that’s why it was crucial to create custom- made rules and to configure the monitoring of syslog-ng configuration files to detect unwanted changes in the configuration that could have an impact on the whole Log Management. By default, the syslog-ng configuration files are on Linux stored in the /opt/syslog-ng/etc/ and the main configuration syslog-ng.conf is stored in this folder and in the C:\Program Files\syslog-ng Agent on Windows. It was necessary to monitor any changes in these folders. The figure 3.2 shows the configuration of monitoring the syslog-ng folder for both Linux and Windows. We configured and tested the real- time monitoring of these folders. We used the shared configuration file that is described in section 2.1.2 which means that we only added these options to the shared file and the configuration was available oneach OSSEC agent server. The centralized management and maintenance are one of the advantages of OSSEC.

#Configuration for Linux /opt/syslog −ng/etc

#Configuration for Windows C:\Program Files\syslog −ng Agent

Figure 3.2: OSSEC syslog-ng folder monitoring

One part of the syslog-ng monitoring was the folder monitoring; the other part was the monitoring of logs and detection of particular words specific for syslog-ng that indicates the incorrect behavior. The figure 3.3 shows the a few rules in the syslog-ng-rule.xml file that we created and tested. The rules are stored on the OSSEC Master server and by default in the /var/ossec/rules folder.

25 3. Deployment

^syslog −ng shutting down Syslog −ng service down

^Invalid frame header IETF syslog protocol

^Syslog connection closed Syslog connection closed

^Destination queue full Destination queue full

^dropping messages Destination queue full

^unable to load certificate CertificateSSL error

Figure 3.3: OSSEC syslog-ng rules

3.2.2 Alerting, Notifications, and Reporting As already mentioned in section 2.1.2, OSSEC supports multiple ways for alerting, notification and reporting, and in this subsection, we will focus on providing more details about this topic and about the configuration that we used in our project. OSSEC supports these three types of alerts:

1. Log Alerts. It means that a log is generated for each alert and by default stored in the /var/ossec/logs/alerts folder and the logs are rotated every day, such logs can be used for auditing purposes.

2. Email Alerts. OSSEC sends an alert to a specified email address.

3. Syslog Alert. OSSEC supports sending alerts to one or more syslog servers on a specified port and in a specified format. The supported formats are CEF 2, JSON, Splunk, and syslog. OSSEC only supports TCP transfer and the TLS transfer is not available.

2. Common Event Format by Arcsight

26 3. Deployment

OSSEC allows the users to specify the minimum severity level to trigger an alert. OSSEC uses its own severity levels from 0 to 15 where level 0 have logs with no security relevance and level 15 have severe attacks with no possibility of false-positive. The figure 3.4 shows how the minimum severity levels were configured in our project.

2 7

Figure 3.4: The configuration of OSSEC Alerts

The logs with the severity level of 2 which is system low priority notification and higher are stored in the folder mentioned above. Only the logs with the severity level of 7 which is a so-called "Bad word matching", which means that the log contains words like "bad, error, warning, etc" and higher and sent on the email. The rules for syslog-ng monitoring that we create have the level 7, so if they are detected the email notification is sent.

Reporting Reporting in OSSEC is a summary of alerts for the day sent on e-mail. It allows you to configure different notifications for different groups of alerts. On the figure 3.3, we can see that our custom rule group is called syslog-ng-rules. According to this identifier, we can set the report. Reporting also allows you to include the logs related to the alert in the e-mail. The only drawback of reporting in OSSEC is that it’s not possible to set different notification period than daily.

Log Analysis The figure 3.5 shows the high-level overview of the OSSEC log process- ing. It only shows four out of eighteen servers, as mentioned above, and it should demonstrate how logs are received by the OSSEC Master server and processed. The log processing consists of multiple parts, such as decoding, analyzing and alerting and this example aims at the custom-made rules that we have created and used in the project. Decoding is used to capture certain information from the logs. It’s the

27 3. Deployment

Figure 3.5: High-level OSSEC Processing

very first operation that happens after the logs are received by themas- ter server. The process of decoding is parsing of different parts of a log to process and analyze them. OSSEC comes by default with hundreds of decoders for different systems, applications and log formats, such as snare, apache, windows event logs or VMware and Cisco logs, etc. In our project, we were using the syslog-ng format, and so that was the reason, why it was not necessary to create our own decoders because the OSSEC provides syslog decoder responsible for parsing and de- coding logs in syslog format. The decoders are by default stored in the /var/ossec/etc/decoder.xml file and in case a custom-made decoder for custom systems or applications is needed, then the decoder is added to this XML file. For custom-made decoders, a regular expression pattern and fields in which the parsed information should be stored have to be defined. OSSEC uses this decoder file and matches thelogs according to the specified format. In the OSSEC configuration file, it’s required to specify the log format for each particular input, according to that the OSSEC know which decoder to use. After decoding, the OSSEC analyze them using the rules that are by default stored in the /var/ossec/rules folder. In case some unusual activity is detected, an alert is generated and stored in the alerts.log file.

28 3. Deployment 3.3 Graylog

This subsection shows two different architectures that were deployed and provides details of the configuration of the whole Graylog system, rotation and retention policies, multi-tenancy, encrypted log collection, dashboards, etc.

3.3.1 Architectures Deployed To review the Graylog and its full functionality that Graylog provides, it was essential to deploy different types of setups in which Graylog can be deployed. We deployed a one-server setup that contained all Graylog components, and we also deployed the High-Availability setup. The need of having servers in High-Availability is described in the 3.1.1. If there is a customer that has increased demand for reliable and powerful Graylog server designed to provide continues uptime, it’s crucial to be able to deploy the High-Availability setup. Originally, Graylog is not designed to be deployed as a High-Availability setup, but with the Red Hat Cluster services such as GlusterFS or Ricci, it was possible to deploy the Graylog in High-Availability setup.

Production setup The first setup that we deployed was the one-server production setup which means that there was only a single central server that contains all the Graylog components. The server used for this setup has following configuration:

∙ Virtual Machine deployed on the VMware ESXi

∙ RHEL/CentOS 6 (64-bit) with 4x CORE

∙ 8 GB RAM

∙ 100 GB HDD

This setup is well documented, easy to deploy and quick to con- figure. In this setup none of the components is redundant. Themain disadvantage of this setup is the availability; the Graylog server is a single point of failure. That’s why this setup can only be used in case

29 3. Deployment the 24/7 availability is not required. The details about how this setup was configured 3.3.2.

High-Availability setup A cluster is a group of servers that act as a single server and provide the high-availability. The Graylog Cluster that we deployed consisted of two servers and one virtual cluster IP address, as following:

∙ graylogweb.axenta.local - Virtual Cluster IP Address

∙ graylogha-server1.axenta.local - Graylog Server + Graylog UI + MongoDB + ElasticSearch + Luci

∙ graylogha-server2.axenta.local - Graylog Server + Graylog UI + MongoDB + ElasticSearch

We used two fully functional Graylog servers that shared the same configuration. The servers had the same configuration as described in the 4.1.2, and Virtual Cluster IP Address through which the Graylog- UI was accessible, and all the logs from our infrastructure were routed to this address and then forwarded to the Master server which was a fully functional Graylog server. We used the Red Hat Cluster capabilities to build the Graylog Cluster. Red Hat Cluster consists of multiple services where each is responsible for something else. Luci service provides the web-based graphical cluster management interface. Ricci service propagates up- dated cluster information to the cluster nodes(servers) and GlusterFS, or Gluster File System provides shared storage accessible from all cluster nodes. The Red Hat Cluster offers much more services, but this section focuses on the services and features used in the project and provides the high-level understanding of the Red Hat Cluster rather than explaining the whole technology in detail. Luci service, or Luci server, provides a management interface for managing and maintaining the cluster. Luci is always installed on one of the cluster nodes, in our project the Luci was installed on the graylogha-server1.axenta.local server and by default, the web interface was accessible on the https://graylogweb.axenta.local:8084 address. Through this web interface, it’s possible to create a cluster, add

30 3. Deployment

or remove cluster nodes, manage the store and configure the services that should be periodically checked to prove the node availability. The services critical for the Graylog functionality that we checked periodically were:

∙ /etc/init.d/graylog-server

∙ /etc/init.d/elasticsearch

∙ /etc/init.d/monogd

It was configured in the way that if one of these services isnotin the running state, then the cluster is relocated to the previously Slave server. Ricci creates and distributes cluster configuration files on the nodes of the cluster. This services must be installed and running on each cluster node. The Ricci user used for the distribution is associated with the Ricci service, and its password is set right after the service installation. Once the Ricci service and the user are configured, the credentials are added through the Luci interface. The Ricci user com- municates by default on the TCP port 11111, and this port must be open on each of the cluster nodes. GlusterFS is a distributed file system that provides shared space visible and accessible for each server node. This shared space was used for storing and sharing the MongoDB and ElasticSearch folders. We created a shared disk partition under the /mnt/sharedfs/, and the Elas- ticSearch and MongoDB folders containing logs, journal file, indexes, and configurations were physically stored on this partition. Journal file is in detail described in the section 4.1.3 The symbolic linksfrom the original paths where created referencing to the shared partition. The figure 3.6 shows how the symbolic links and share partition was configured. On the same figure, we can also see how the incoming logs from RS are routed based on the current cluster state. The one-server setup was easy to deploy and straightforward to configure, the High-Availability setup, on the other hand, requires more resources, deeper knowledge about how the Red Hat system works and it’s more advanced solution.

31 3. Deployment

Figure 3.6: Graylog Cluster Architecture

3.3.2 Server Configuration This subsection focuses on providing details about how the Gray- log servers were configured. The production setup and the High- Availability setup had both the same configuration. In the project, we set a few requirements that we wanted to meet; it includes both functional and non-functional requirements:

∙ Deploy the High-Availability setup to provide the fail-over solu- tion,

∙ examine the possible log sources and configure the TLS en- crypted log transfer,

∙ examine in detail the rotation and retention policies and archiv- ing and multi-tenancy as well,

∙ implement the custom patternDB for log parsing,

∙ analyze the analytical capabilities such as searching, filtering, and dashboards, and

32 3. Deployment

∙ document the installation and configuration.

Log Collection We configured two different types of inputs to receive logs fromthe Relay Server. First was the plain syslog TCP transfer and the other one was TLS encrypted syslog transfer. The encrypted transfer is used in the production that’s why it was more important to configure this type of input correctly. The Graylog-UI allows you to define the TLS certificate and key paths for a particular input, and so enable theen- crypted transfer. The certificates and keys were generated by using the OpenSSL toolkit and signed by the AXENTA Certification Authority. The concept of Streams is already described in the 2.2.3 section. For the analysis purposes and the categorization of logs, we used in the total of seven different Streams.

∙ AXENTA APP: containing application logs from applications such as Centreon, Bing, databases, Apache, virtualization system or Gitlab.

∙ AXENTA SECURITY: containing security logs from services and systems such as OSSEC, Flowmon, Nessus or syslog-ng Store Box.

∙ AXENTA OS: containing logs from operating systems.

∙ AXENTA NETWORK: containing network logs from firewall or Cisco network devices.

∙ AXENTA DEBUG: containing debug logs from testing servers

∙ Internal Logs: containing internal Graylog logs

∙ All Messages

The logs received from the Relay server are already tagged, that means that on the RS server there is a tag describing the class, customer, and device added to the logs. Such approach is used to make it clear from where the logs come from. We defined the rules according to which logs were routed to a particular Stream based on the class field.

33 3. Deployment

As mentioned in the subsection , using Index Sets it’s possible to set different rotation, retention and archiving policies for different types of logs. We had logs organized in seven different Streams, but we used one common set of policies for each of them. We set the log rotation policy for one day, which means that each day at midnight the logs from the previous day were rotated to other files. These files are called indices. After the rotation, Graylog creates the indices with the postfix as following: axenta_network_set_0, axenta_network_set_1, axenta_network_set_2. This example shows three days of logs for the AXENTA NETWORK Stream rotated every day. The disadvantage of this approach is that it’s not clear from the first sign what day are the logs from. The OSSEC log structure described in the 2.1.2 is much more straightforward and clear. In the open-source version, there are in total of three retention strategies available and the licensed version which is the version that was tested in the project offers one more. These four retention strate- gies are: ∙ Close indices in ElasticSearch to reduce the resource consump- tion. ∙ Delete the indices in ElasticSearch to minimize the resource consumption. ∙ Do Nothing which can have a negative impact on the systems free storage. If the log storage is not controlled and monitored, the system can run out of space. ∙ Archive indices, the Commercial, licensed feature. The log archiving is crucial, and that’s the reason why we were mostly focusing on this type of policy. Even if it is a paid feature, Graylog doesn’t support (version 2.3.1.) archiving logs on external storage, such as NAS. NAS or Network-Attached Storage is a dedicated file storage device that provides file-based storage shared within the LAN 3. In the AXENTA’s SOC, NAS is often used for archiving logs but for archiving Graylog logs it would require another custom-made solution

3. Local Area Network

34 3. Deployment

such as a script that would archive the logs on the NAS. Graylog’s archiving feature only allows archiving logs on the local file system, and it provides different compression types, such as gzip, LZ4 orto leave the archived logs with no compression.

Processing Graylog offers a feature called Extractor; it allows creating own log patterns for parsing logs and assigning field names to particular field values. As an example, pfSense firewall logs are not parsed by de- fault, that’s why there is a custom-made log pattern needed. Graylog supports multiple ways of how the patterns can be imported to the Ex- tractors. The patterns can be created using regular expressions, Grok language which is a set of regular expressions that can be combined to more complex patterns, JSON, or use the Split&Index approach. All the patterns can be created through the web interface, but the pattern can also be imported as a JSON file. Graylog allows the use of different delimiters which are the characters that separate fields and according to these delimiters the log fields are parsed. We used and tested the Split&Index option with the comma delimiter. In the Split&Index you have to specify the position of the field in the log and then assign a particular field name.

361,,,1463297446,em5,match,block,in,4,0x0,,63,48370,0,DF,1,icmp,84,192.168.61.67,192.168.91.3, request ,9050,514

Figure 3.7: Not parsed logs from pfSense firewall

Figure 3.7 shows an example of pfSense firewall log message that needs a parser. From the pfSense documentation [12] where the log format and the log fields are defined, we can see that: ∙ "block" is a field that describes the action taken; we define this field as ax_network_action. ∙ "192.168.61.67" specifies the source IP address, we define this field as ax_network_srcIp, ∙ "192.168.91.3" specifies the destination IP address, we define this field as ax_network_dstIP,

35 3. Deployment

∙ and "9050" and "514" are the source port and destination port fields that were defined as ax_network_srcPort and ax_network_dstPort.

This shows just a few field examples that we used in the project. For the effective and investigation, it’s crucial to have the logs parsed because filtering, searching and different aggregations can bedone using the field names. With the correctly parsed logs, it’s possible to perform advance searches. For example, it’s possible to find firewall logs where the traffic comes from a particular source IP address, and it’s received by a particular destination port. Without the field names, we would not be able to perform such fine-grained analysis. The Ex- tractors are applied right after the logs are received by the Graylog server. Apart from the Extractors, Graylog supports creating custom log fields that combine different fields into one consisting all ofthem. From the example above, we can create a common field, such as "ax_network_dstIP:ax_network_dstPort" which finally gives us "192.168.91.3:514". In the project, we didn’t use any of them, but this is the way is can be done if needed.

Analysis

The Analysis belongs to one of the most important processes in any Log Management solution. One of the goals of having a Log Manage- ment solution is to make the process of analysis easier and more effi- cient. Graylog uses ElasticSearch as its main search engine, which pro- vides not only searching but also analytical capabilities. device:Linux, is an example of a search query used to find all Linux logs, or all logs containing value Linux in the device field. Graylog uses logical operators AND, OR and NOT where all of them are case sensitive and have to be written in capital letters. ax_network_srcIP:192.168.51.1 AND ax_network_scrPort:555 AND NOT ax_network_dstIP:192.168.52.1 demonstrates the use of search query with logical operators. When searching for particular logs, it’s important to specify the correct pe- riod of time in which we want to find the logs. There are three ways how to specify the time period in Graylog:

36 3. Deployment

1. Relative: Graylog comes with a pre-defined set of relative time ranges such as Last 5 Minutes, Last 3 Days or Last 2 Hours that can be used by default to specify the time period. Apart from the pre-defined ranges, Graylog allows users to create their own relative time ranges. Graylog uses the ISO 8601 durations in the P[n]Y[n]M[n]DT[n]H[n]M[n]S format and all the custom-made periods have to follow this format. For example, PT15M stands for last 15 minutes, P5D stands for last five days, and PT8H stands for last 8 hours.

2. Absolute, where the absolute times and dates on the from-to basis are set manually or from the calendar.

3. Keyword: Graylog offers to specify the time frame in the natural language before the keyword is used, Graylog shows you the preview of the dates and times that are going to be used. Gray- log uses the parser called Natty for natural language parsing. Examples of a keyword are: last hour, 7 days ago, 27th of April to 2nd of June 2017 or yesterday midnight to today midnight. The only problem with the keywords is that there is no list of keywords that can be used, for example, there were a few reported issues related to parsing the keyword today midnight, so without a full list of keywords, it’s not pretty clear which keywords can be used and which cannot.

Graylog also supports the Saved Searches functionality that allows users to save the search queries and time periods for the further use. For example when the same search is performed repeatedly. The user has to specify the name of the saved search according to which this search is later found. Graylog provides search result histogram which is a graph of sta- tistical information that shows the number of logs received, grouped by a certain period of time. The figure 3.8 shows the example of ahis- togram that represents the number of received logs shown per hour. We can see, for example, that at 7 AM on May 4, 2018, we received 4152 logs from our infrastructure. The histogram, together with the several other tools helps to analyze the search results. Graylog allows users to choose the fields that will be shown on the Web Interface to

37 3. Deployment make the analysis more clear and easier to do. There are three analytic tools provided by Graylog:

Figure 3.8: Number Of Logs Received shown on Histogram

1. Field Statistics. This tool is only useful for numerical fields and provides statistical information, such as total, mean, minimum, maximum, standard deviation, variance, sum, and cardinality.

2. Quick Values. This tool was the most commonly used in our project. It shows the value distribution for a field. It also shows the total number of times each value appeared in the result, and it contains the pie chart, as well. An example of this type of analysis is shown on the 3.9.

3. Field Graphs. Which is a histogram used for numerical fields that can be used for a particular field.

Figure 3.9: List of Devices sending logs as a Quick Values Analysis

38 3. Deployment

Graylog supports powerful dashboards that provide pre-defined views on the important data. A user can use for example the Search Result Counts which shows the total number of search results for a particular search. In the project, we used this type for showing how many HTTPS requests on port 443 we had on different servers in the infrastructure in the past 24 hours. For example, we used the device: Centreon AND ax_networkdstPort:443 search and used the result on the dashboard. This showed us how many HTTPS requests we had on Centreon server. The other types of widget that can be used on a dashboard include the Search result histogram charts, statistical values, stacked charts or quick values results. In the project, we used a combination of them to build the dashboards that could be used for the potential customers. Graylog officially offers a plugin called Geo-Location Processor that visualized the geolocation information by scanning logs for fields containing an IP address. The basic requirement for this is the con- figuration of the geolocation database, Graylog supports MaxMind databases from which webpage it’s possible to download the database and upload it to the Graylog server. When Graylog detects the IP address, it creates new fields in which the geolocation information for the plugin is stored. In case of our project, we had two fields con- taining the IP address, ax_network_srcIP and ax_network_dstIP. The plugin creates three new fields using the following suffixes: _geoloca- tion,_contry_name and _city_name. So if the original message contained the ax_network_srcIP field, then in the Graylog we had these fields: ax_network_srcIP, ax_network_srcIP_geolocation, ax_network_srcIP_country_name and ax_network_srcIP_city_name. The plugin shows the world map with the geolocation information. We used this plugin to display the source and destination IP addresses of the devices that are trying to connect to our infrastructure and to see where are the servers from our infrastructure trying to connect. The figure 3.10 shows the example of the geolocation plugin for thelog source IP addresses.

Alerting, Notifications and Reporting Alert is a notification that delivers personalized and actionable content to specified recipients or subscribers. The alerts in Graylog are always

39 3. Deployment

Figure 3.10: Source IP Addresses shown using the Geo-Location Plugin based on Streams. For each Stream, we can define conditions that trigger alerts. Graylog triggers the alerts when the specified conditions are satisfied. Graylog supports three different types of conditions that we can define.

∙ Field Content Alert, where we define a specific value for a field and the incoming logs are checked whether they contain the defined value.

∙ Field Aggregation Alert, where we specify the aggregated val- ues such as sum, max or min for a particular field in a given period of time. For example, trigger an alert if in last five minutes there are X messages containing a specific value.

∙ Message Count Alert, where we specify the total number of messages for a period of time. For example, if there are more than hundred messages each minute, it is a deviation from the standard behavior and the alert should be triggered.

On the Graylog Web Interface, it’s possible to check the reason why an alert was triggered and also the alert timeline that shows the events that took place during the alert. It’s important to mention that Graylog always triggers only a single alert for a particular condition during the alerting interval. It means that even if there are multiple messages that contain a specific value, the alert is raised only once. Apart from the alerts shown on the Web Interface, Graylog sup- ports two types of notifications:

40 3. Deployment

∙ HTTP Alert Notification, which means that in case an alert is triggered Graylog contacts the specific endpoint and will send an HTTP request to the notification URL address containing the details about the alert.

∙ Email Alert Notification which is used to send an email to the specified receivers. Graylog allows us to specify the Email- Subject, Sender, E-Mail Body and User or E-Mail Receivers.

In the project, we created and tested example alert conditions that were able to detect the DOS4 attacks against the web server and to detect the password cracking attempts. For the first case, the alert is triggered if there are more than hundred firewall logs per minute containing the axenta_network_dstPort=443 and axenta_network_dstIP= web_server_ip. If an attacker tries to request the encrypted web server running on port 443 more than hundred times per minute the alert is triggered and the notification is sent on the specified email address which was the [email protected] that is used in the whole infras- tructure as a notification address. The notification if the password cracking attempt is detected is triggered when more then fifty logs in one minute contain the "login failed" string in the message field. These are just the examples of possible alerts that can be detected, the alerts and their conditions can be defined according to needs. However, the built-in alerting is useful for a lot of different scenar- ios, more advanced and more complicated scenarios cannot be created. Let’s consider this example: Trigger an alert if the same source IP ad- dress tried to connect to the Graylog server on port 443 more than fifty times in the past minute. This example is slightly different from the one mentioned above with the DOS because it requires aggregate search that groups the source IP addresses and returns the number of requested connections per source IP address. Such scenario can be very useful for the security purposes, but such conditions cannot be created only by the built-in capabilities. That’s the reason why we tried the external plugin called Aggregates. This plugin provides capabilities that allow us to create more advances and aggregated conditions. This plugin uses the built-in notifications when the alert is triggered.

4. Denial of Service

41 3. Deployment

By default, Graylog doesn’t support any type of reporting. It re- quires external plugins that would generate the reports. The problem is that, in fact, all the reporting plugins available for the version of Graylog that we have tested were just a combination of aggregate alerts that come with the Graylog by default and a crontab script that is responsible for sending the reports in a specified time period. It means that there is no unified solution for reporting in Graylog. The external plugins extend the basic Graylog functionality and are available free of charge on the Graylog Market webpage5. Graylog supports custom-made plugins which give the Graylog users control and opportunity to create their own plugins based on their needs. The plugin has to be developed in Java programming language. The main problem related to the external plugins that we identified in the project was that the plugins are not officially supported. The plugins offered on the Graylog Market have to meet the Graylog’s technical requirements[13] but they are developed by the external entities and not by Graylog developers. There are often compatibility problems on the newer Graylog releases. At the beginning of the project, we started with the Graylog version 2.2.3 and at the end we working with the version 2.3.1, the problem is that the plugin that we were using at the beginning didn’t work on newer versions, so the plugins that we were using on a daily basis, such as aggregate alerts, internal logs, reporting, etc. were not available. The external plugins highly depend on the entities that developed them. This is the problem especially in the production environment because the customers require a particular level of functionality that is provided even after the upgrade to the newest version. It the entity that developed the plugin is no longer working on the compatibility and bug fixes then the plugin is not useful anymore. The more external plugins we used, the more difficult it was to upgrade the Graylog.

Authentication,LDAP,Permissions and Multi-tenancy. Authentication is the process of recognizing user’s identity, and Gray- log supports both internal and external authentication. Internal au- thentication means that the users are created through the Graylog

5. https://marketplace.graylog.org

42 3. Deployment web interface, and their credentials are stored on the Graylog server. External authentication means that the server connects to an Active Di- rectory (AD) or LDAP6 server using the service account and validates the user’s identity through the network. The SOC uses an Active Di- rectory server for the authentication of the whole infrastructure. In the project, we used the encrypted LDAP authentication which is running by default on port 636 using the service account called graylog_ldap. Graylog also supports authentication by passwords, Sessions, API Tokens, or Single Sign-On authentication. Graylog doesn’t support user groups that means that it’s not possible to create a group of users with the particular permissions for example. In the Graylog the ACL, or Access Control List which is a list of permissions for a particular object (user), has to configured on a per-user basis. This approach is not very handy when it comes to a large number of users because in case some permission needs to be changed, it is required to change each user. Graylog supports Roles which are sets of permissions assigned to a user. The Roles specify Streams and Dashboards permissions and also the read/write operations related to them. This means that for a par- ticular Role we can configure the access to specific Streams, and sothe users with different roles can access different Streams. One usercan have one or more roles, and the permissions are then combined. Gray- log allows us to pre-define the group-mapping. By group, in this case, we mean the group of users created on the Active Directory/LDAP server. It means that for users in the particular AD/LDAP group it’s possible to automatically set the Role(s). Graylog comes with two built- in Roles that cannot be changed. Admin Role grants all permissions and should be assigned only to the Graylog administrators. Reader Role grants only basic permissions needed for all Graylog users. Each user needs to have either Admin or Reader Role in the combination of more specific Roles that can be created through the Graylog web interface. This approach appeared to be problematic when we were trying the multi-tenancy capabilities of Graylog. SOC usually has more than one customer, and the multi-tenancy capabilities are crucial to let a different customer have access to differ- ent resources without knowing each other. As an example, if we have

6. Lightweight Directory Access Protocol

43 3. Deployment a customer A and a customer B, we want them both to use the same Graylog application where the customer A has access to the logs in Stream A, and customer B have access to the logs in Stream B and they don’t know that they both are using the same Graylog application. As mentioned above, in the tested version 2.3.1, Graylog requires each user to have either Admin or Reader Role. The problem is that Roles only specify the resource’s permissions and not the parts of the Gray- log application that are visible for a particular Role. So even if both customer A and customer B have Reader Roles in the combination with Roles that allows them to have access to logs in their Streams they can still read the Input, Output, Stream, User, etc. configurations. So even if the customer A doesn’t have access to the logs in Stream B, this customer can see that there is a Stream B for another customer, and vice versa. The multi-tenancy is important for the business purposes, and after intense testing, we found the workaround that allowed us to revoke the Reader Role from a customer users and leaves only the Role that specifies customers Streams. To do this, we used the Graylog REST-API and the command are shown in figure 3.11. After running this command, the user customer A has only the Role A assigned.

c u r l −XDELETE −u ADMIN:PASSWORD http://graylog_ip_address:9000/api/roles/Reader/members/customerA

Figure 3.11: Multi-tenancy REST-API command for revoking Roles

On one side, we reached our goal which was to hide the customers from each other, on the other side this workaround needs be done for each user separately, and this method is not supported or documented by the Graylog.

Configuration Backup and Restore During the testing, it happened that our Graylog server ran out of space. Not only the logs that were stored in the ElasticSearch caused the problem but also journaling that was enabled. Graylog uses the journal file to store logs before they are processed by the ElasticSearch. It’s a mechanism that protects logs from being lost in case the Elastic- Search is not able to process them. The journal file raises dramatically

44 3. Deployment

if the ElasticSearch cannot handle the load or the processing in Graylog takes too much time. Even after we deleted all the logs from a journal file and ElasticSearch, the log processing still caused problems, and the application started to behave unpredictably - there were indexing and searching problems, and sometimes the ElasticSearch was not able to process logs for no particular reason. These problems led to the re-install of the server. If such problems occur and re-install with the preserver system configuration is needed, the system’s configuration needs to be backed up and restored on a newly created server. The problems that we had with the Graylog gave us the opportunity to try the Backup&Restore capabilities of Graylog. The Graylog configuration backup consists of following steps: ∙ Backup of the MongoDB by using the mongodumpcommand which creates the dumpfolder in the current directory and cre- ated the database backup there. ∙ Backup of the main Graylog configuration file, by default stored in the /etc/graylog/server/server.conf file. ∙ Backup of the Apache configuration stored by defaulf in the /etc/httpd/conf and /etc/httpd/conf.d folders. Apache is used for running the Graylog’s Web Interface. The process of restoring the configuration requires the use of mon- gorestore command in the dump folder and replacing the default main configuration file and Apache configurations by the backed upones. Even if the process of Graylog restoring was successful and we deployed the identical copy of the previously running server, a few problems and bugs came to light. We identified following problems related to the Backup&Restore process in the 2.3.1 version of Graylog that we were testing: ∙ The Backup&Restore process is very poorly documented, and the clear definition of which folders and files should bebacked up is entirely missing. ∙ The restore was not successful when it comes to the LDAP, the LDAP group mapping was missing, and so the Graylog was not able to assign correct Roles to particular users. This needed to be fixed manually.

45 3. Deployment

∙ The ElasticSearch indices changed their naming conventions. For example from the previous /var/lib/elasticsearch/graylog/n- odes/0/indices/axenta_security_set_0/ to the /var/lib/elasticsearch /nodes/0/indices/wV7hZ-F2RVKyYSzQR7vbpw/ and the naming changed for each of the indices similarly. The problem is that it was not clear anymore which ElasticSearch files belong to which Stream. We also installed and tested the external configuration backup plugin[14] available on the Graylog Market, but unfortunately, in the Graylog version that we have tested it has a bug, and this plugin is not shown on the Graylog Web Interface, and so it cannot be used.

3.4 OSSEC and Graylog Integration

The last section of the Deployment chapter focuses on providing de- tails about the integration of both systems that we used in our project and their deployment architecture within the AXENTA a.s. infrastruc- ture. As described in section 3.2, OSSEC is not able to forward received logs but only the alert logs that the OSSEC generated. OSSEC is capa- ble of log analysis, but we decided not to use this feature for one main reason: The advanced log analysis, if needed, can be done either in Graylog or SIEM, that’s why we would prefer to use these technologies to do the log analysis instead of OSSEC. The other disadvantage of doing the log analysis by OSSEC and also sending logs to the Relay Server is that we would need to send all the logs twice through the network. In the project, we focused more on the file integrity monitor- ing for files and folders that are used for the Log Management inSOC and AXENTA a.s. infrastructure. The files that we monitored were mostly related to the syslog-ng configuration. The monitoring was done in real-time, and the alerts sent on the AXENTA a.s. notification e-mail and forwarded to the Relay Server via syslog. The file integrity monitoring and OSSEC were installed, configured and tested on, in the total of eighteen critical servers in the AXENTA a.s. infrastructure. Graylog system was deployed as a High Available Graylog cluster consisting of two virtual servers and one cluster IP address responsi- ble for routing the traffic to the current Master server. Even if Graylog

46 3. Deployment

Figure 3.12: Graylog Cluster Architecture

is not designed to be deployed as a High-Available setup, we used different 3rd parties technologies, such as Red Hat Cluster and Glus- terFS and created a fully functional High-Available Graylog Cluster. Graylog received all the infrastructure logs from the Relay Server which collected the logs from individual servers in the infrastructure. The Relay Server was also deployed as a High Available cluster, and such setup with two High-Available Log Management systems built the reliable and secured Log Management solution. Even if one of the Relay Servers and one of the Graylog servers became unavailable, the setup would be still able to collect and analyze logs without any changes. Log transfer in the infrastructure was secured by TLS en- cryption, except receiving logs from servers that don’t support TLS encryption such as firewalls, switches or VMware and alert logs from OSSEC because OSSEC doesn’t support encrypted syslog transfer.

47

4 Testing and Results

4.1 Graylog Throughput Testing

This section focuses on providing details about the Graylog through- put testing, it summarizes the testing requirements, explains the test- ing methods and testing process, and explains different configuration parameters and their impact on the Graylog throughput. The second part of this section focuses on the summary of the OSSEC system, explains the findings and describes our experiences with this system in our project. The Graylog testing that we performed during our project consisted of multiple parts:

1. Testing Server Preparation: We prepared two testing servers with different server configuration (CPU Cores, RAM Memory), on which we tested the throughput. Testing on different servers allowed us to get the more accurate throughput results and showed us the limitations that each server configuration has. With the throughput results from different servers, it’s possible to estimate what server configuration is needed to be able to process a particular number of logs per second.

2. Testing Scenarios Preparation: To get the most accurate results, it was required to test different processing configurations and their combinations. We prepared the total of five testing scenar- ios that consisted of a combination of four main configuration parameters, namely batch size, input, processing, and output buffers. Batch size is a parameter that defines how many mes- sages can be sent to the ElasticSearch at once. The buffer pa- rameters, on the other hand, specify the number of processing threads in each buffer. Buffers and log processing is in detail described in this section.

3. Testing Methods and Script Preparation: We developed a spe- cial script that was responsible for generating logs and forward- ing them to the Graylog server, and for time measuring in which Graylog processed the generated logs. A unified testing method

49 4. Testing and Results

was required to get the most accurate results for each of the testing scenario.

4. Results Summary and Result Understanding: It was necessary to understand the impact of each configuration on the Graylog throughput and to make the comparison of results for different testing scenarios. The details testing results are summarized in the subsection .

4.1.1 Graylog Buffering To understand the process of testing, architecture and the configura- tion parameters that we tried in our project, it’s essential to understand how Graylog internal processing works. Graylog uses different internal buffers designed to cache small amounts of messages for a very short time (milliseconds) on their way through different processors. The figure 4.1 shows how are different types of buffers interconnected with the journal file and the Elas- ticSearch. We demonstrate this internal buffering structure because during the throughput testing the configuration of these buffers and journal file appeared to have a big impact on the Graylog throughput. Graylog uses three different types of buffers.

1. Input Buffer. Stores the incoming messages

2. Process Buffer. Stores the messages right after they are pro- cessed by the journal file.

3. Output Buffer. Stores the messages right before they are pro- cessed by ElasticSearch.

Graylog uses a journal file which ensures that the incoming mes- sages are kept safe in case of a server failure. All the incoming messages are firstly written to the journal file and then after that process. Journal file is also used as a mechanism for storing messages in casethere are so many incoming messages that Graylog would not able to ef- fectively process them, or the message output to the ElasticSearch is too slow. In these cases, messages are kept in the journal file in- stead of the main memory. The messages are processed in the FIFO

50 4. Testing and Results processing mode, so messages first received are processed first. The use of journal file can be enabled or disabled in the main configura- tion file /etc/graylog/server/server.conf using the message_journal_enabled = true/false parameter. The journal file is by default located in the /var/lib/graylog-server/journal folder. Disabling the journal file means that the messages can be potentially lost in a case that Graylog is not able to process messages fast enough. By default journal file holds the messages for a maximum of 12 hours or 5 GB. These parameters are configurable.

Figure 4.1: Graylog Internal Processing - Buffering

On the figure 4.1, we can see the different configuration parameters and their values. Each Graylog part, including ElasticSearch, Graylog server and processing buffers, is configurable and we will explain the configuration options later in this section. Graylog also allows us to configure the number of processing threads, called processors, assigned to each buffer. It means that multiple threads for each buffer can run concurrently and process the incoming messages separately. The increased number of processing threads helps Graylog to be more dynamic and flexible, and process the messages faster. In our project, we were testing different combinations of these parameters to choose the most powerful one regarding throughput. The explanation of configuration parameters that we tested inthe project:

∙ ring_size is the maximum number of messages in a buffer, and this number has to be a power of 2 (256, 512, 1024, ...). The size can be set for each buffer separately.

51 4. Testing and Results

∙ batch_size is a maximum number of messages on the Elastic- Search output. It defines how many messages can be sentto ElasticSearch at once.

∙ output_flush_interval is a flush interval in seconds that defines how often the messages are sent from the output buffer to Elas- ticSearch. The messages are sent to the ES either when they reach the maximum batch size or when the flush interval is reached.

∙ input/process/output_processors is a parameter that specifies the number of processors assigned to each of the buffers. The num- ber of processors, in fact, means the number of threads process- ing the messages and on the figure 4.1 we can see four input buffer processors, five process buffer processors and three out- put buffer, processors.

On the figure 4.1 we can also see how much RAM was assigned to Graylog and ElasticSearch. These parameters are defined in the /etc/sysconfig/graylog-server file. In our project and for all testing cases we used the recommended RAM configuration which means at least 50% available RAM for ElasticSearch, 25% for Graylog application and the remaining 25% for the server itself.

GRAYLOG REST API Graylog provides a powerful REST API for the exchange of information between Graylog Web Interface and Graylog Cluster. The same REST API can be used to get the metrics of different parts of Graylog, such as cluster and system statistics, alerts, user information, plugins, indexes, dashboards, etc. The average throughput and some basic metrics are shown on the Web Interface, but detailed, more accurate and up-to- date information is accessible only through the REST API. The REST API is accessible by default on the http://IP_ADDRESS:9000/api/ but the port on which the API is listening is configurable in the main Graylog configuration file. During our testing we mostly focused on the following metrics:

∙ org.graylog2.buffers.OutputBuffer.incomingMessages which returns the number of messages received by the Output Buffer.

52 4. Testing and Results

∙ org.graylog2.shared.buffers.ProcessBuffer.incomingMessages which returns the number of messages received by the Process Buffer.

∙ org.graylog2.buffers.input.usage which returns the number of mes- sages received by the Input Buffer.

∙ org.graylog2.outputs.ElasticSearchOutput.writes which returns the number of messages written to the ElasticSearch from the Output buffer. This metric was crucial for us for testing purposes and it returns the total number of messages and 1, 5 and 15 minute average values.

∙ org.graylog2.journal.size which returns the current number of messages in the journal file

During the testing, we monitored all of these metrics to have a full overview of the Graylog state. The custom-made script that we used to determine the throughput of the application is described in the subsection 4.1.2.

4.1.2 Testing Explanation The main goal of the testing was to determine the average Graylog throughput with different configurations and on different servers. The throughput of a system is an important indicator that determines whether or not the particular system is capable of processing large amounts of messages. According to the average throughput, it’s possi- ble to determine the use-cases in which Graylog can be used and in which it can’t be used. In case there is a customer looking for a Log Management solution with the average number of logs/events per second more than the maximum Graylog throughput, the Graylog can’t be used for this customer, or the whole Graylog Cluster has to run on a more powerful server which brings higher costs. So this number is crucial in the design and planning phases of the Log Management project. To determine the throughput accurately, we used two different servers for testing purposes; one called Small Graylog Server was a less powerful server and the other one called Big Graylog server was more powerful one. The difference was mostly only in the number of CPUs and the amount of RAM. Testing on two different servers gives us the

53 4. Testing and Results idea of how scalability of Graylog works and how the performance is increased when using the more powerful server.

Small Graylog Server Big Graylog Server

∙ Virtual Machine deployed on ∙ Virtual Machine deployed on the VMware ESXi the VMware ESXi

∙ RHEL/CentOS 6 (64-bit) ∙ RHEL/CentOS 6 (64-bit) with 4x CORE with 12x CORE

∙ 8 GB RAM ∙ 32 GB RAM

∙ 100 GB HDD ∙ 100 GB HDD

For testing, we used three different technologies, namely Graylog REST API, logger and a custom-made script that calculated the average throughput. We realized that the throughput which we want to deter- mine is based on the time in which the Graylog engine can receive the messages until they are processed by ElasticSearch. That’s the reason why we used the org.graylog2.outputs.ElasticSearchOutput.writes metric as the most important one because it returns the number of messages written to the ElasticSearch. We used the syslog message generator called logger that can gen- erate the syslog messages at a specified rate, size and for a specified period of time. The script is shown on the figure , and it demonstrates how we determine the throughput. The script was running on the Graylog server and requested the information about the current num- ber or messages in the Graylog every 0.1 seconds. The script started to measure the time immediately after the very first message was received by Graylog. When the 1 million messages were processed the script stopped measuring and calculating the processing duration. We determined the throughput as a quotient of the number of messages (1 million) and the processing duration. This gave us the average number of messages per second that the Graylog processed.

54 4. Testing and Results

#!/bin/bash counter=0 while true do count=‘curl −XGET −s http://admin: admin_password@ip_add:9000/api/system/metrics/org. graylog2. outputs. ElasticSearchOutput.writes| jq’.count’‘

i f["$count" −gt 0 ]&&["$counter" −eq0 ]; then s t a r t=$(date +%s.%N) counter=$(( counter + 1)) f i

i f["$count" −ge 1000000 ]; then end=$(date +%s.%N) echo $count break f i sleep 0.1 done

duration=$(echo"$end␣ −␣$start"| bc) echo"The␣duration␣was:␣"$duration

throughput=$(echo"$count/$duration"| bc −l) echo"The␣average␣throughput␣is:␣"$throughput

Figure 4.2: Testing throughput script

4.1.3 Results and Findings

As written above, it was not only necessary to test the Graylog through- put, but it was also important to understand the impact that each con- figuration parameter had on the throughput. This subsection explains the testing results and summarized the findings, and what we found out during the testing. The main goal of this testing was to find out the maximum number of logs that the Graylog can process in a sec- ond. Knowing what system configuration is needed to get a required throughput is crucial when designing a log management system. We tested the combination of three most important configuration parameters, namely batch size, process and output buffers, and the value of input buffer was set to 5. ElasticSearch, which is responsible for processing and indexing logs, always processes the batches of logs. Processing data in batches where a group of data is collected over a period of time and then processed at once is an efficient way of processing high volumes of data. In Graylog, a batch is a group of logs collected from the output buffer on a stage of time and processed at once. The details about Graylog buffers are provided in the subsection . We used the following testing scenarios:

55 4. Testing and Results

1. Batch size of 500, Process Buffer of 5 threads, Output Buffer of 5 threads: In this scenario, there are five processing threads in each buffer responsible for log processing. The batch size of 500 means that the ElasticSearch always processes a group of 500 logs at once.

2. Batch size of 1000, Process Buffer of 5 threads, Output Buffer of 10 threads: In this scenario, we increased both the batch size and the number of threads in the output buffer twice from the first scenario.

3. Batch size of 10000, Process Buffer of 10 threads, Output Buffer of 10 threads": In this scenario, ElasticSearch processes 10000 messages at once, and the processing is done using 10 threads in the processing buffer and 10 threads in the output buffer.

4. Batch size of 20000, Process Buffer of 20 threads, Output Buffer of 20 threads: In this scenario, ElasticSearch processes 20000 messages at once, and the processing is done using 20 threads in the processing buffer and 20 threads in the output buffer.

5. Batch size of 20000, Process Buffer of 20 threads, Output Buffer of 15 threads: Similarly to the forth testing scenario, 20000 logs are processed at once, and the processing is done using 20 threads in the processing buffer and 15 threads in the output buffer.

The number of Input Buffer processors was set to 5 for each of the testing scenarios because according to the documentation, this attribute doesn’t have a big impact on the performance. The journaling was enabled to avoid the loss of log messages. We also performed tests with the disabled journaling but the results were not very different from the results with enabled journaling, and that’s why we decided to use the journal size and potentially keep the logs save. Each testing scenario was tested four times on both servers, and the average of these four tests was the result of a particular testing scenario. This gives us the total of 40 separated tests. The figure shows the Graylog throughput testing results for allof the testing scenarios and both testing servers. On the Y-Axis we can

56 4. Testing and Results

Figure 4.3: Graylog Throughput Testing Results

see the average number or events/logs per second, and on the X-Axis we can see the different testing scenarios described above. On the X-Axis, B stands for batch size, P stands for some processing buffers and O stands for a number of output buffers. What we realized is that in a case of a Small server the limiting factor was the number of CPU cores, that’s the reason why there are no significant differences in the throughput results for different testing scenarios. According to the information that we got from the Graylog support, the number of input, process, and output buffer threads is strongly limited by the number of CPUs that the particular server has. It means that the 4 CPUs ( in case of the Small server) are equally used by all the buffers and the increase of the number of buffer threads simply doesn’t have any effect on the throughput because the server is not capable of running more threads at the same time and so not able

57 4. Testing and Results to process more logs at the same time. On the other hand, we can see significant differences for the Big server for the 1st and the 5th testing scenario. Increasing the batch size from 500 to 20000, the number of processing buffers from 5 to 20 and the number of output buffers from 5 to 15 appeared to be the most powerful configuration regarding throughput. We can see that the throughput of the 5th scenario is almost two times more than the throughput of the 1st scenario on the Big server. Not only the number of server CPUs but also the RAM played a role in the Graylog throughput. Having more RAM helps to increase the technical performance of a server and lower the processing latency. The Big server had four times more RAM available than the Small server which means that we were able to assign more RAM to each or the Graylog processing parts. In case of the Small server, we assigned 4GB of RAM to the ElasticSearch, 2GB of RAM to the server and 2GB of RAM to the Graylog system. On the other hand, for the Big server we assigned 20GB of RAM to the ElasticSearch, 4GB of RAM to the server and 8GB of RAM to the Graylog system. The server itself doesn’t need more than 4GB of RAM for its own purposes. During the whole testing, we monitored the servers by Centreon monitoring tool which is an open-source system and network monitor- ing application based on the SNMP1. We monitored different Graylog server’s parameters such as Load, CPU Usage, Storage, Running Pro- cesses, Folder Size, etc. and on the figure we can see the graphs for the CPU usage and running processes that consume the most CPU during the testing of the Big Graylog Server. Graylog is running under JAVA and so the JAVA CPU Usage, as shown in the graph, is the most significant one. We can also see that the CPU usage for the whole server reached more than 60% a few times during the testing. The X-Axis in both graphs represents the time, and the Y-Axis represents the percentage of CPU used. We were able to successfully test each testing scenario for both for servers and compare the results. We realized that for enterprise solution capable of processing up to hundred thousand logs per sec- ond, there is a server with much higher server configuration needed. The maximum throughput that we were able to reach was slightly

1. Simple Network Management Protocol

58 4. Testing and Results

Figure 4.4: Graylog Server Monitoring Showing the Processes and CPU Usage

less than 27000 logs processed in a second, and we found out that the batch size is the most important parameter that can be configured. The number of messages processed by ElasticSearch had the biggest impact on the throughput of the whole system. On the other hand, our tests showed that journaling doesn’t have any impact on the system, which means that logs can be kept safe even if there is a processing problem on the system, in such case incoming logs are stored in the journal file.

4.2 Summary of OSSEC

This section focuses on the summary of OSSEC and explains our findings and experiences with this system. The most significant things that we did with the OSSEC include:

∙ Successful testing of the centralized HIDS system management from a single server with the agent and agent-less architecture. Agent-less architecture is required especially if it’s, for some

59 4. Testing and Results

reason, not possible to install an agent on the monitored server. We were able to successfully run both of them at the same time, which means that the majority of the servers in the infrastructure was monitoring using the OSSEC agent, but we still had a few servers running in an agent-less mode.

∙ the creation of custom-made filters and rules for monitoring critically important configuration files related mostly to syslog, apache, and database.

∙ Successful testing of the alert logs forwarding to Graylog, and we proved that OSSEC doesn’t support forwarding raw logs received from the agent servers.

∙ The documentation of deployment and configuration process and created a guide about how to deploy such system in case a potential customer of AXENTA a.s. will be interested in the intrusion detection.

OSSEC is relatively easy to set up, and both OSSEC Master and OSSEC Agent installations are straightforward and clear. One of the biggest advantages that we realized is that the whole monitoring can be con- figured from the master node and so it’s not needed to configure each server separately. Exactly for this purpose, the shared configuration file located on the master node is used. It makes the configuration and tunning much easier and much more efficient. OSSEC also allows us to configure each server differently according to our needs, inthis case, we would use the local configuration file instead of the shared one. OSSEC also provides a wide range of features such as Rootkit Detection, Log Monitoring, File Integrity Checking, Active Response, etc., that can be used either together or separately according to needs. This gives us the ability to use only what we need. The main problem that we found out was that OSSEC is not able to forward logs to the other systems, it can only forward alert logs. Forwarding logs from the OSSEC would require additional 3rd party technology or a script. OSSEC’s Log Analysis is undoubtedly a very powerful feature but in the AXENTA a.s.’s SOC solutions we use more detailed and more powerful tools for log analysis, such as SIEM. OSSEC Log Analysis engine parses logs according to the specified patterns and matches the

60 4. Testing and Results

specific keywords to reveal the unusual activities. Even if we didn’t use this feature in our project, OSSEC could analyze logs in different formats and from different sources which make this feature very use- ful. OSSEC system is well documented with a lot of useful examples and guides and has extensive community support. Despite its advantages, OSSEC has some notable drawbacks. In our project, we only used one version of OSSEC, namely 2.9.0, but the process of upgrading is notoriously problematic. The upgrade overwrites the custom-made rules and parsers with the default once so to keep the custom-made configuration it’s necessary to export and re-import them. This brings a lot of problems because if the user forgets to export the configuration, or an error occurs during this process, then the whole custom-made configuration is lost and has to be done again. OSSEC also requires fine-tuning of rules to avoid false positives and requires the tuning of the log levels as described in section 3.2.2. The other problem with the OSSEC is that the OSSEC Master server can only be a Linux-like server, OSSEC doesn’t support Windows OSSEC Master setup. In the AXENTA a.s. infrastructure, this is not a problem, but if we consider a customer with the servers running only on Windows, then this can be a huge drawback for such customers and environments. The other problem that we found out is that the OSSEC agent can die silently and the OSSEC Master is not capable of detecting it. According to the OSSEC Official documentation, it’s recommended to restart each OSSEC agent daily to avoid the problems. Each service restart produces more logs, and it should be used as a solution to a problem. In the AXENTA a.s. SOC we use a monitoring tool called Centreon that monitors the current state of the OSSEC agents, and it detects and notify us in case that the OSSEC Agent is not running.

61

5 Conclusion

The main objective of this master thesis was to examine the OSSEC and Graylog technologies both theoretically and practically. The the- oretical part includes the description of both technologies and their features. The practical part consisted of deployment and integration of both systems, the creation of custom configurations and rules for the AXENTA a.s. purposes, the performance testing of Graylog system and long-term testing of OSSEC in the AXENTA a.s. infrastructure. One of the main goals of the project was to figure out whether the OSSEC and Graylog tools are useful for the AXENTA a.s.’s SOC pur- poses, and if they are then how to deploy and configure them so they will be used as additional functionality to the current services offered by SOC. We realized that the OSSEC system is considerably easy to deploy in the default configuration even in the large infrastructure, it provides centralized management which is very useful in large infras- tructures containing a large number of servers. Default configuration provides only a basic set of rules for intrusion detection, that’s why fine-tuning of the detection methods and rules is required, thispro- cess can be very time-consuming and requires advanced knowledge of the system. We also tested and found out that OSSEC tool is not capable of log forwarding to the other systems. It only forwards alerts which means that if we want both to collect and to monitor the logs by OSSEC, all the infrastructure logs have to be sent twice - once to the OSSEC Master server and once to the Relay Server. That has an effect on the traffic in the infrastructure. We also found out that Graylog is in the single-node configuration considerably easy to install and prepare for basic log collection. On the other hand, multi-node set-ups can be challenging mostly because of the ElasticSearch configuration. Precise ElasticSearch configuration requires a deeper knowledge of how the ElasticSearch works in the background. The other problem that we found out it that in case the Graylog server runs out of storage it’s difficult to restore the before- failure state of the system. In our problem, when this happened, we were forced to re-install the Graylog server and import the server configuration from the other Graylog server. When is comes tothe performance, Graylog is capable of processing large amounts of data

63 5. Conclusion but to be able to do so it requires a powerful and most of the time multi- node set-up on which the system is running. Such multi-node system requires advanced maintenance in compare to the other out-of-the-box systems which might be a disadvantage in some projects. Despite all these drawbacks, Graylog appears to be a very useful system especially for a customer looking for an open-source log management solution with lower volumes of logs. In the project, we were able to successfully deploy both systems and integrate them. One part of this project included the presentation of both technologies for the team of System Engineers of the AXENTA a.s. company, to show them the options, features, and drawbacks of both systems. This master thesis fulfilled the requirements set in initial phases of the project.

64 Bibliography

1. Intrusion detection system (IDS) [online]. USA: TechTarget, 2007 [visited on 2018-02-10]. Available from: http://searchsecurity.techtarget. com/definition/intrusion-detection-system. 2. Host- vs. Network-Based Intrusion Detection Systems [online]. USA: SANS Institute [visited on 2018-02-10]. Available from: https://cyber- defense.sans.org/resources/papers/gsec/host-vs-network- based-intrusion-detection-systems-102574. 3. IDS: File Integrity Checking [online]. USA: SANS Institute, 2014 [visited on 2018-02-10]. Available from: https://www.sans.org/reading- room/whitepapers/detection/ids-file-integrity-checking- 35327. 4. OSSEC Official Documentation [online]. USA: OSSEC Project Team [vis- ited on 2018-02-10]. Available from: https://ossec.github.io/ docs/. 5. OSSEC Official Documentation [online]. USA: OSSEC Project Team [vis- ited on 2018-02-10]. Available from: http://ossec-docs.readthedocs. io/en/latest/manual/monitoring/. 6. Linux Rootkit Detection With OSSEC [online]. USA: Sally Vandeven/- GIAC (GCIA) Gold Certification, 2014 [visited on 2018-02-10]. Avail- able from: https://www.giac.org/paper/gcia/8751/rootkit- detection-ossec/126976. 7. Graylog Official Documentation [online]. USA: Graylog, Inc [visited on 2018-02-10]. Available from: http://docs.graylog.org/en/2.4/. 8. Elasticsearch Official Documentation [online]. USA: Elasticsearch, 2018 [visited on 2018-02-10]. Available from: https://www.elastic.co/ guide/en/elasticsearch/reference/current/index.html. 9. Splunk and the ELK Stack: A Side-by-Side Comparison [online]. USA: ASAF YIGAL/DEVOPS, 2017 [visited on 2018-02-10]. Available from: https: //devops.com/splunk-elk-stack-side-side-comparison/.

65 BIBLIOGRAPHY

10. The syslog-ng Store Box 5 LTS Administrator Guide [online]. USA: Balabit, a One Identity business, 2018 [visited on 2018-02-10]. Available from: https://syslog-ng.com/documents/html/ssb-5.0-guides/en/ ssb-guide-admin/pdf/ssb-guide-admin.pdf#index. 11. Log collecting performance [online]. USA: Balabit, a One Identity busi- ness, 2018 [visited on 2018-02-10]. Available from: https://www- prod.balabit.com/documents/ssb-5.0-guides/en/ssb-whitepaper- performance/html/log-collecting-performance.html. 12. Official pfSense documentation [online]. USA: pfSense, 2018 [visited on 2018-02-10]. Available from: https://doc.pfsense.org/index. php/Filter_Log_Format_for_pfSense_2.2. 13. Plugins [online]. USA: Graylog, Inc [visited on 2018-02-10]. Available from: http://docs.graylog.org/en/2.4/pages/plugins.html. 14. External Graylog Ba [online]. USA: GitHub, 2018 [visited on 2018-02-10]. Available from: https://github.com/fbalicchia/graylog-plugin- backup-configuration.

66