Masaryk University Faculty of Informatics

Detection of network attacks using HTTP related information

Master’s Thesis

Lenka Kuníková

Brno, Spring 2017

Masaryk University Faculty of Informatics

Detection of network attacks using HTTP related information

Master’s Thesis

Lenka Kuníková

Brno, Spring 2017

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Lenka Kuníková

Advisor: RNDr. Pavel Minařík PhD.

i

Acknowledgement

I would like to thank to my advisor RNDr. Pavel Minařík, PhD. and to Mgr. Martin Juřen for guidance and useful advices.

iii Abstract

This thesis deals with extended HTTP network flows and their ap- plication for the detection of various attacks and anomalies on the network. It highlights advantages of extended HTTP flows on cho- sen attacks, it implements and tests existing detection methods and suggests numerous improvements. Furthermore the thesis analyses in detail User-Agent request header. It describes possibilities of how this field can by used for anomaly detection and explains problems related to User-Agent analysis.

iv Keywords

Flow monitoring, HTTP, Anomaly detection, User-Agent, Brute-force attack, SQL injection

v

Contents

1 Introduction 1

2 HTTP 3 2.1 Basic concepts ...... 3 2.2 URI ...... 4 2.3 Message format ...... 5 2.3.1 HTTP request ...... 5 2.3.2 Methods ...... 7 2.3.3 Response message ...... 9 2.3.4 Status code ...... 10 2.4 Architectural Components of the Web ...... 11 2.4.1 Virtual hosting ...... 11 2.4.2 Proxy servers ...... 12 2.4.3 Caching ...... 13 2.4.4 Gateways ...... 14 2.4.5 Tunnels ...... 14 2.5 Authentication and secure HTTP ...... 14 2.6 HTTP/2 ...... 16

3 User-Agent 17 3.1 Format ...... 18 3.1.1 Non-browser ...... 18 3.1.2 Browser ...... 19 3.2 Compatibility and spoofing ...... 20 3.3 User-agents in MU network ...... 21

4 Network monitoring 23 4.1 Experiment setup ...... 25 4.2 Shortcomings of monitoring method ...... 26

5 Network Scanning 27 5.1 HTTP scanning ...... 28 5.1.1 Incoming traffic ...... 28 5.1.2 Outgoing traffic ...... 29 5.2 Directory traversal ...... 31 5.3 Summary ...... 32

vii 6 Brute Force Attacks 33 6.1 Targeted authentication methods ...... 34 6.2 Attacks against HTTP based authentication ...... 34 6.3 Attacks against authentication using POST method ..... 35 6.4 Attacks against authentication using GET method ...... 40 6.5 Summary ...... 41

7 Code injection 43 7.1 SQL injection ...... 43 7.1.1 Basic principle ...... 44 7.1.2 Attack types ...... 44 7.1.3 Detection ...... 45 7.2 Cross-site scripting ...... 47 7.2.1 Detection ...... 48 7.3 Summary ...... 50

8 User agent anomalies 51 8.1 Blacklists and pattern matching ...... 52 8.1.1 Known malicious UA strings ...... 52 8.1.2 Company policies and unwanted software . . . 54 8.1.3 Code injection ...... 55 8.2 Missing User-Agent ...... 55 8.3 Many User-Agents from one IP address ...... 57 8.3.1 Outgoing traffic ...... 58 8.3.2 Incoming traffic ...... 59 8.4 Unusual User-Agent ...... 61 8.5 Discrepant User-Agent ...... 63 8.6 Summary ...... 65

9 Conclusion 67

A List of created scripts 69

Bibliography 71

viii 1 Introduction

HTTP is one of the most commonly used application layer protocols. Every time a person tries to display a web page in a browser, the communication is carried out by HTTP or its secure version HTTPS. A significant part of network traffic is therefore performed viaHTTP. However, everything that is frequently used is also frequently misused and HTTP is no exception. On one hand, HTTP servers are common targets of various types of attacks, for example in order to take control over the server. On the other hand, botnets often use HTTP for the communication because it is easily hidden in the rest of the network traffic. Consequently, network administrators monitor their networks to detect and mitigate this malicious behaviour. There are two main approaches to network monitoring: deep packet inspection and monitoring of network flows. This thesis deals with the second option. Although originally flows contained only information from the 3rd and 4th layer of ISO OSI model, extended flows also support export of the fields from application layer protocols including HTTP. The aim of this thesis is to explore the possibilities of how HTTP fields can be used to detect various anomalies and attacks on the network. The thesis analyses chosen attack types that can be identified thanks to extended flows. It describes existing detection methods, tests them on data from Masaryk University network and if possible, it also suggests and implements several improvements. Furthermore, the thesis deals with HTTP User-Agent field in detail. It highlights diversity of User-Agents, explains the problems related to their analysis and outlines possibilities of how this field can be used for anomaly detection Every described method is also implemented and tested on a real traffic. The thesis begins with a theoretical chapter about HTTP protocol, followed by a section dedicated to User-Agent definition. Third chap- ter shortly explains flow monitoring and describes the setup used for experiments. Following chapters are dedicated to various attacks and anomalies. First of them focuses on network scanning at HTTP level, the next one describes two attacks based on code injection and possibilities of detecting them. Chapter 7 deals with brute force attacks and chapter 8 presents five distinct concepts of how to use User-Agent field for anomaly detection. 1

2 HTTP

The Hypertext Transfer Protocol (HTTP) represents the base protocol for accessing the . It first appeared shortly after Tim Berners-Lee introduced a proposal of World Wide Web in 1989. With his team at CERN, they were responsible for creation of HTTP as well as Hypertext Markup Language (HTML)[1]. Since its first documented version, HTTP/0.9, the protocol has undergone multiple important changes, but it still remains one of the most ubiquitous application layer protocols. Despite the fact that in 2015 its newest version, HTTP/2, was pub- lished, this chapter describes the previous version – HTTP/1.1. Theory explained in the rest of the chapter is mostly based on RFC 7230 [2] defining HTTP message format and RFC 7231 [3] defining the seman- tics.

2.1 Basic concepts

HTTP uses a -server model. Protocol defines syntax and seman- tics of the messages that a client and a server exchange in order to deliver the web page the client has requested. Clients are usually rep- resented by web browsers, but they are not the only option. An HTTP client can also be an antivirus program checking for updates, or a that helps an Internet search engine to create its database. Among HTTP servers, the most commonly used are Apache or Mi- crosoft Internet Information Server. Servers store web resources. A resource can be a simple HTML file, an image or a dynamically generated content. Such objects are addressable by Uniform Resource Identifier (URI). Client initiates a connection, creates a request for an object on specified URI, and sends this request to a server. The server retrieves requested object from its storage and sends it back to the client in an HTTP response message. HTTP presumes a reliable, connection oriented transport layer pro- tocol. Therefore, HTTP does not address problems related to missing packets or their reordering because it assumes everything was deliv- ered successfully. Normally, HTTP runs on the top of transmission control protocol(TCP) and the default port is 80, but if the number

3 2. HTTP is explicitly stated, any port can be used. The common alternative is 8080. HTTP can use both persistent and non-persistent connections. In case of non-persistent connections, each request/response pair is sent over a different TCP connection. In version 1.1 of the protocol, persistent connections are the default; multiple requests are combined into a single connection in order to reduce response delay. Another important property of HTTP is being stateless. This means that the server is not required to keep track of information about the users for the duration of multiple requests. Each request needs to be standalone and contain all important information to be satisfied [4].

2.2 URI

URI is a sequence of characters used to identify a resource. The most common type of URI is a Uniform Resource Locator (URL). It is a subset of URIs that, in addition to identifying a resource, provides the means of locating it [5]. Another option is Uniform Resource Name (URN), which does not provide a way how to locate the resource – it is location independent. URNs are not widely adopted and they will not be further discussed. The thesis only focuses on and those used in HTTP satisfy the following syntax:

://:/?#

Scheme defines used protocol, which is HTTP in this case. Thehost component identifies the server hosting the resource. It can be either in the form of a hostname or an IP address. The next field determines a port the requested server is listening on. In HTTP, 80 is the default value. Path field specifies the location of the resource on the server. There is no official format for query component, but key=value pairs separated by ampersand (&) are commonly used. They can further specify the requested resource. For example, content of the web forms can be placed there. Usage of the last component (fragment) can be easily explained when referring to HTML page. Fragment allows entity inside HTML, like a concrete paragraph, to be defined. Character set used by URL is very limited. Only letters of the basic Latin alphabet, digits, and certain special characters are allowed. Some of these characters are reserved and they have special meaning,

4 2. HTTP

for example: question mark (?), colon (:), or hash (#). Others are not reserved and can be used arbitrarily, such as dot (.), underscore (_) or tilde (~). This restricted charset is sometimes insufficient. When letters from different alphabets, reserved characters, or non-printable control characters are part of the URL, they need to be encoded. Every octet of UTF-8 (8-bit Unicode Transformation Format) representation of a character is replaced by a triple: a percent sign and two hexadecimal digits. Therefore space will be replaced by %20 and "ô" by %C3%B4.

2.3 Message format

There are two types of HTTP messages: requests and responses. Re- quests are sent from clients to servers, and responses the other way around. Formats of both are presented in the following sections.

2.3.1 HTTP request HTTP request starts with a request line containing a method used, a requested URL and a version of the protocol. Request line is followed by zero or more header lines and then, separated by an empty line, an optional message body (see fig. 2.1). Used method defines the action that is supposed to be performed on the resource. Object can be retrieved, updated, or deleted. There are 8 basic methods: GET, HEAD, POST, PUT, DELETE, OPTIONS, TRACE and CONNECT. Protocol also allows to define own methods. The next field of the request line belongs to URL. It is common that instead of absolute URL just its part is used, starting from path. Host- name and potentially the port number are defined in a Host header. When the request concerns server as a whole and not a particular resource, an asterisk (*) appears in a place of URL. Last part of the starting line is dedicated for a version number. The most commonly used version, and the one presented in this chapter, is 1.1. HTTP headers follow the starting line. A large amount of them exists but only a selected subset is presented in this thesis. For the full list, RFC 7231 [3] should be consulted. Headers contain addi- tional information regarding the request, communicating parties or the communication itself. They have a simple format: name and value

5 2. HTTP

Figure 2.1: HTTP request message format [6] separated by a colon. The most frequently used headers include al- ready mentioned Host, User-Agent and headers responsible for cache control or content negotiation. The most important ones, for the pur- pose of this thesis are Host, User-Agent and Referer1.

Host It contains domain name of the server hosting the requested resource and a port number, if the default one is not used. This header is mandatory and it needs to be filled in even if the URL in the request line is in its absolute form. Although the field may seem redundant, since the server is already identified by the underlying IP protocol, it is important in case of virtual hosting (see section 2.4.1).

Referer It identifies the resource from which the targeted URLwas obtained. When the user clicks on a link in a web page, the URL of this web page should be placed in the Referer field. When the URL does not have a specific source (for example, it is typed by the user), this header is omitted. There are some security related concerns about the usage

1. The name is misspelled on purpose. There was an error in the original specifica- tion, and the incorrect word is still used for the compatibility.

6 2. HTTP

of this field, since it reveals information about the browsing history of the user. Some intermediaries (proxies or installed security software) may in some cases strip the Referer away. Information contained in this header is mostly used for generating back-links, various statistics or logging.

User-Agent The field contains information about the application that created the request. Because of its importance for this thesis, it will be covered in a separate chapter.

The header lines are followed by the content of the message. The message body is not mandatory, and whether it is present depends mostly on the method used. For example, the client can put the content of web page forms there. Unlike headers, message body do not have to be ASCII text. The message can also contain images, videos or another binary content. A typical example of an HTTP request is given below.

GET / HTTP1.1 Host: is.muni.cz User-Agent: /5.0 (X11; Ubuntu; Linux x86\_64; rv:51.0) Gecko/20100101 Firefox/51.0 Accept: text/,application/xhtml+xml,application.xml;q=0.9, */*;q=0.8 Accept-Language: en-US,en;q-0.5 Accept-Encoding: gzip, deflate Connection: keep-alive

2.3.2 Methods Although there are 8 official HTTP methods, just a few of them are commonly used. Not every method needs to be implemented on every resource. Figure 2.2 shows data collected on the network of Masaryk University (MU) over the period of one hour. As can be seen, GET and POST are the most prevalent ones, and methods TRACE and DELETE did not occur at all.

GET The basic method of HTTP protocol asks for a resource on a server specified by the URL. The GET message does not contain any

7 2. HTTP

Figure 2.2: Portion of used HTTP methods at MU network entity body. Although it is not its main purpose, GET also allows clients to send the contents of HTML forms to the server. This data is inserted into query part of the URL. It is important to realise that this data is stored in browser history and eventually on a cache server. Method can be also used in a form of conditional GET, when a requested entity is supposed to be sent only under some circumstances. Conditional GET is intended to reduce unnecessary network usage. This concept is closely related to caching and it will be further elaborated on section 2.4.3. Another way to reduce network congestion is use of partial GET. In this case, just part of the identified entity is sent.

POST This method was designed to send a block of data to the server. The actual action performed on the data is determined by the server. Contrary to the GET method, input data is sent as the message body and it is not directly in URL. There are many guides describing which method should be used when dealing with forms. As a general rule, GET method should be used when dealing with idempotent actions. Sensitive, binary or lengthy data should by passed by a POST method.

HEAD This method is similar to GET, except that the server does not send back the entity body, just the headers. It is used to get meta- information about the resource without actually transferring it.

8 2. HTTP

OPTIONS With this request, a client can ask the server about its capabilities. An asterisk can be used instead of URL meaning that the client wants information about the server itself, and not a particular resource.

CONNECT Used to establish an HTTP tunnel – another network protocol encapsulated using the HTTP. The request with CONNECT method is sent to the proxy responsible for the tunnelling.

PUT Request that the entity sent as a message body is stored on a server under the supplied URL. The server can either create a new resource or replace the existing one. This is often used with Web publishing tools.

DELETE Method asks the server to remove a resource on the sup- plied URL. It is up to the server to decide how to react to such a request. According to the specification the resource does not have to be deleted.

TRACE Method is used for diagnostic purposes. It allows a loopback of the request to be invoked. The final recipient should reflect the request in the message body of its response. HTTP message can pass through several intermediaries on its way to the server, and each of them have the possibility to change the request. This method allows to show the client how the request looks when it reaches its final destination.

2.3.3 Response message HTTP response starts with a status line consisting of an HTTP version, a status-code, and a reason phrase. It is followed by zero or more header lines and then, separated by an empty line, message body which can be empty. Status code is a 3-digit number describing the result of the request. There are 5 classes of response codes which will be described in the next section. In the next field the status code is explained in a human- readable format as a reason phrase. Similarly to the HTTP request, the next part consists of headers. These headers usually specify the type,

9 2. HTTP encoding and length of the sent entity, can be related to cache-control, or they can specify the application that sent the response. A typical example of an HTTP response (omitting the message body) is given below.

HTTP/1.1 200 OK Server: Apache/2.4.10 (Debian) Cache-Control: max-age=15 Expires: Fri, 24 Mar 2017 22:46:22 GMT Last-Modified: Fri, 24 Mar 2017 22:46:08 GMT Content-Encoding: gzip Content-Length: 16821 Content-Type: text/html; charset=utf-8 Date: Fri, 24 Mar 2017 22:46:08 GMT Via: 1.1 varnish Connection: keep-alive

2.3.4 Status code Not all HTTP requests are successful. Sometimes, the method used is not implemented, resource does not exist, or a client does not have permission to access it. Status code is a method of how to inform the client what happened with its request. According to the first digit, status codes are divided into 5 categories. For each category, several codes are used for the demonstration.

∙ 1xx: Informational – This class of codes is not widely used, it only has two defined representatives. 100 Continue is intended for optimization and 101 Switching Protocols indicates that the server agrees with the client’s request to change the communi- cation protocol.

∙ 2xx: Success – These codes indicate that request was processed successfully. 200 OK is a usual answer for a GET request; asked resource is included in the entity body. 206 Partial Content is a positive answer for a partial GET request, while 204 No Content informs about a successful request, but there is no data that is sent back to the client.

10 2. HTTP

∙ 3xx: Redirection – This class of codes usually means that the re- quest was permanently (301 Moved Permanently) or temporarily (302 Found) moved to another location. To access the resource a new request to a URL specified in a Location header needs to be sent. Code 304 Not Modified has a different meaning. It is used in conjunction with a conditional GET request and indicates that the client still has an up-to-date version of the resource.

∙ 4xx: Client Error – Class of codes used when there is a problem with the request sent by the client and this request cannot be fulfilled. When there is a syntactical issue with the request server sends code 400 Bad Request. If the resource is not present on a server, 404 Not Found is sent. Code 403 Unauthorized indicates that an authentication is needed prior to accessing the resource.

∙ 5xx: Server Error – This class covers cases when the request is valid but for some reason it cannot be accomplished. The most likely response and the most general one is 500 Internal Server Er- ror. 503 Service Unavailable indicates a temporary problem, and it can include Retry-After header specifying when the client should try to resend its request.

2.4 Architectural Components of the Web

Although in the most simple scenario, the client and the server are the only two participants of HTTP protocol, in many cases an intermediary element plays an important role in the communication. This chapter presents such elements, namely proxy servers, gateways and tunnels. Further it covers techniques such as virtual hosting and caching. Men- tioned mechanisms are not crucial for the rest of this thesis, but they are included to preserve completeness of the chapter.

2.4.1 Virtual hosting In the early stage of HTTP, each hosted one web site, at most. In such conditions, it was unnecessary to specify the hostname in a request, so only relative URLs were used to specify the path to the document. A problem appeared with expansion of virtual hosting.

11 2. HTTP

The main idea behind virtual hosting is that a simple web page does not always exhaust resources of the web server and so it is con- venient to share the capacities of servers between several customers. Each customer has a completely independent web page, but all of them share the same physical server. In such circumstances, relative URLs became a problem. When a request on index.html arrives on a server, it does not know which virtual host is being accessed. That is the reason behind introducing Host header that became compulsory in version 1.1.

2.4.2 Proxy servers

Proxies are intermediaries between the client and server and they need to be able to implement the functionality of both. Proxy receives client’s request and forwards it to a real server. As it can come to contact with the passing HTTP traffic, proxy can modify it to imple- ment many useful value-added web services. There is a wide range of scenarios where it can be used. Security can be enhanced by imple- menting uniform access control strategy in a corporate settings, or by employing an anonymizer removing identifying characteristics from HTTP messages. Proxy can also improve the performance. One way is to implode web caching, another way is to route requests to web servers based on Internet traffic conditions [6]. There are four basic techniques how to redirect an HTTP request to a proxy instead of the original server. First, client’s browser can be configured directly to use a proxy. When a client sends a request to such proxy, the request line contains the full URL so the proxy can forward the request to the original receptor easily even in an absence of the Host header in the older version of the protocol. Second option is a surrogate (proxy server placed in front of web server). In this case, proxy adopts the name and IP address of web server. Another option is to modify the web server so that it sends 305 Use proxy response with proxy URL specified in a Location header to the client on each request. Last option is the so-called transparent/intercepting proxy. This is achieved by modifying the network structure (using routing or switching techniques).

12 2. HTTP

2.4.3 Caching

Caches are intended to keep copies of the most popular resources. When the client’s request goes through a web cache, and the requested document is present, the cache can forward it to a client without the need to contact the original server. This approach has several ad- vantages. It reduces the redundant data transfer, demand on original server, and distance delays. Since the web content often changes, copies in a cache can become obsolete. There needs to be a mechanism ensur- ing that the client always receives fresh data. Four different situations may arise [6].

∙ Cache miss – The resource is not present in a cache and it must be retrieved from a server. Its copy is saved in the cache and it is forwarded to a client.

∙ Cache hit – The requested object is stored in a cache and it is fresh enough. It can be directly sent back to a client without the need to recontact the server. To determine if the copy is fresh, headers Cache-Control: max-age or Expire can be consulted. Max- age defines number of second that the document is valid afterit was originally retrieved from the server. Expire specifies exact date and time when the document becomes stale.

∙ Revalidate hit – When the resource is stored in a cache but its lifetime expired, it needs to be revalidated. Conditional GET method is used for this purpose. Condition is expressed in a HTTP header. If-Modified-Since and If-Non-Match are the most popular ones. If-Modified-Since compares the time specified in Last-Modified header of the stored resource and the time of the last modification stored on the original server. If-Non-Match works differently. Server provides a special header Etag that acts as a serial number. When a document is changed on a server, it receives a different Etag. When these tags match, it means the copy on the server and the one in the cache are the same. If this condition is satisfied, server replies with 304 Not Modified and the copy in the cache becomes valid again.

13 2. HTTP

∙ Revalidate miss – When the conditional GET is evaluated as false, the document needs to be resent. The cache saves its new value and forwards a copy to the client.

2.4.4 Gateways Unlike HTTP proxy, which is an intermediary between two parties communicating using the same protocol, a gateway allows protocol conversion. The term gateway is used with two different types of sit- uation. One example is when we want to access a resource on a FTP server using an HTTP request. There must be a gateway on the way that will understand HTTP request, transform it to FTP, contact the FTP server, get the result, and send it back to a client as an HTTP response. Another use of this term concerns application servers. Nowadays, people use HTTP not only to access a specific resource, but also to inter- act with wide range of applications. In this case, gateway is part of the application server. It communicates using HTTP protocol with clients and transforms their requests to commands for programs running on the server side.

2.4.5 Tunnels Tunnelling is a way to encapsulate non-HTTP traffic using HTTP protocol. It is used to bypass firewalls which do not allow a specific protocol, or to enable forwarding of the traffic through a proxy that does not support intended protocol. To create such tunnel, CONNECT method is sent to a remote proxy. Proxy establishes the connection with a remote server and when the client receives 200 Connection established response, it can start sending its data which is blindly relayed through the proxy.

2.5 Authentication and secure HTTP

Sometimes, it is suitable that the data sent through a network remains confidential, or access to it is restricted. The following chapter de- scribes the possibilities HTTP protocol provides to achieve this goal.

14 2. HTTP

HTTP comes with two authentication schemes, basic[7] and digest au- thentication[8]. Moreover, it provides a way to implement additional scheme using authentication framework defined in RFC 7235[9]. How- ever, it does not provide a way to encrypt the data and if encryption is desired, HTTPS should be used.

Basic authentication HTTP server may divide its resources into groups called security realms and then associate different access rights to each realm. Once client sends a request for a restricted document, server returns a 401 Unauthorized response with WWW-Authenticate header specifying accessed realm. The client needs to know the user- name and the password, join these two together by a comma, and encode it in BASE64 format. The resulting string is put in Authorisa- tion header of a new request. If the credentials are valid, server sends back the requested document. This authentication scheme cannot be considered secure because passwords are practically sent in plaintext. This scheme is supposed to be used in a friendly environment where confidentiality is not a necessary feature, but a convenient one. Secure way to use basic authentication is in conjunction with encrypted data transmission, such as SSL.

Digest authentication The scheme was invented to overcome flaws of basic authentication. It added two key features. Firstly, the password is never sent as plaintext; instead, its message digest is used. The second important feature is the use of nonces – unique server specified strings created for each request. These are added to message digest to prevent replay attacks.

HTTPS The increasingly sensitive nature of the data travelling through the HTTP protocol requires implementation of a mechanism that ensures the secure communication. HTTPS is basically HTTP sent by a secure channel created by SSL (Secure Sockets Layer)/TLS (Transport Layer Security). It protects the privacy of the whole HTTP message, including method, URL and headers and it supports server side or mutual authentication. Nowadays, this protocol is widely adopted and it is used every time some level of security is required. Despite its popularity and wide range of use, this thesis will not deal

15 2. HTTP with the HTTPS traffic. The main reason is that URL and header fields used for the anomaly detection are encrypted in HTTPS and therefore unusable.

2.6 HTTP/2

HTTP/2 is second major revision of HTTP protocol that was pub- lished in 2015. This new version is based on SPDY protocol original developed by Google. Compare to version 1.1, HTTP/2 leaves most of the high-level syntax intact and changes are mostly related to data representation and transportation. The new protocol is binary and so more efficient to parse and less error-prone compared to the previ- ous version. Other important modifications include, multiplexing of requests and responses, header compression and server push mecha- nism. According to W3Techs HTTP/2 is currently used by approximately 14% of all websites [10]. Despite the fact that encryption is not com- pulsory for the new version, the majority of web browsers decided not to support unencrypted HTTP/2 traffic. This is the main reason, why the thesis only focus on the previous version of the protocol.

16 3 User-Agent

HTTP is widely used in application software. The application soft- ware is often inhomogeneous; for example, it can differ in rendering capabilities. HTTP server needs to know what type of application it is communicating with to adjust content to the client’s needs. A demon- strative example is a server that uses different page layouts when sending response to a browser on a mobile device and to an ordinary desktop browser. For this purpose, HTTP provides User-Agent (UA) field which determines browser or application being used and itcan also include some details about the underlying platform. According to specification, the field is not compulsory, but client is supposed to include it. Initially it had 3 main use cases [11].

∙ Statistical purpose – It may be interesting for a developer to know what kind of software people use the most commonly to access its websites. Such information can be used for targeted advertisement.

∙ Tracing of protocol violations – Mostly used at the early stage of HTTP. If a server is receiving large amount of badly formatted requests, the administrator can check whether they are coming from the same application software. The administrator can than warn developers of the client software about the protocol viola- tion, or at least block request from such software to protect the web server.

∙ Tailored responses – The most prevalent use of the User-Agent nowadays. Different content is sent to different users accord- ing to their client software. This technique is often denoted as User-Agent sniffing. The server can use special, browser depen- dant features or send custom JavaScript or CSS files according to capabilities of the receiving application. Even a completely dif- ferent page layout may be sent to mobile devices. Nevertheless, the trend in web development is to avoid User-Agent sniffing, because the obtained information is not reliable.

17 3. User-Agent 3.1 Format

RFC does not provide many details about how the User-Agent field should look. According to the latest version [3], the field should iden- tify user agent software and its significant subproducts. It consists of one or more product tokens, optionally followed by a comment. A product token contains its name and possible version number. These product identifiers should be given in decreasing order of their signif- icance. Furthermore, RFC encourages clients not to use the identifiers of other applications in order to declare compatibility with them. How- ever, this advice is often not taken into account. The current situation around User-Agents is chaotic because of the vague definition of its format and purpose in the specification. While major browsers usually provide information about the format of their UAs and the meaning of individual tokens, situation is even less transparent with non-browser applications.

3.1.1 Non-browser Non-browser UAs are miscellaneous and information about them is often missing in the application documentation. If an unknown UA appears on the network, it is an non-trivial task to find an application it belongs to. In the best case scenario, the format is simple, including one token with application name and its version: Microsoft-CryptoAPI/10.0 This UA is sent by applications using standard Microsoft Windows cryptographic library present on Windows 10, though a much more complex User-Agent strings appear on the network. Demonstrative example is the one sent by ESET antivirus software: ERA Update (Windows; U; 32bit; RMV 1034; RAV 5.3.33.0; OS: 6.1.7601 SP 1.0 NT; Mirror; TDB 32668; x64s; APP era; PX 0; HWF: 01000000-2222-2222-0000-AAAAAAAAAAAA; PLOC en_us; RAF 1.0.0.1.; BPC 5.3.33.0; RACL 720.0.10; RALG 3731.0.90209.0.0.0.0.0.358.0; RART 0.0.0; RARI 0.-1; RAGR 1.11.0; RAPL 1; RANT 0; RARP 0; RAUP 1; RATS 8.71557) It includes software identification (ESET Remote Administrator), de- tailed information about running operating system, hardware finger- print (anonymised in this example), and some application related data.

18 3. User-Agent

However, ESET does not specify this format anywhere, and the mean- ing of individual subparts can only be assumed. User-Agents sent when playing certain online games may appear even more unusual:

ros ZxLmlgIv17oTL4HiBifK5lpiheQIYcrgWiPX Such UA is probably sent when playing an online version of the well- known game GTA (Grand Theft Auto) on the server rockstargames.com. This information is based on a thread [12] on Reddit where fans try to analyse network traffic generated by this game. No other source of information can be found. Moreover, User-Agent is changing for every request and only the first three characters remain stable. There is another category of applications that does not identify themselves correctly. Instead, they pretend to be a browser. Some servers do not process requests that are not coming for regular browsers and so if a non-browser application wants to access such servers, it needs to fake its UA.

3.1.2 Browser In case of browser User-Agents, the situation is different and their format is more regular. Usually it starts with the keyword Mozilla and a version, followed by platform details enclosed in brackets, used rendering engine (with compatibility comments), and finally browser name and version (possibly more than one):

Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 /537.36 Most of the contemporary browsers present themselves as Mozilla for compatibility reasons. In the example above, the underlying operating system is Windows 10. WOW64 indicates the presence of a x86 emula- tor that allows 32-bit applications to run in 64-bit environment [13]. The client uses a version of the Chrome from February 2017. Although Safari keyword is present in the UA string, it does not mean that Safari browser is installed on the user’s computer. WebKit, KHTML and Gecko are examples of rendering engines. Such engine is a component of the web browser responsible for content rendering. Although all of them are mentioned, none of them is actually used by

19 3. User-Agent the client. This version of Chrome is running Blink, that was forked from WebKit in 2013 [14]. The technique when browsers claims they are something else than they actually are is called User-Agent spoofing and nowadays it is a common practice. There are only a few cases when the browser’s UA does not start with Mozilla (old version of browser being one of them). User-Agents of create further complications. For many years, it allowed third party applications to inject their identifi- cation in the UA. If an installed program adds its token to a specific registry key, it will appear in the UA. This leads to two main problems. Firstly, field can become excessively long; secondly, it reveal toomuch information about installed software, which can be abused. Although Internet Explorer renounced this approach in 2010 with the version IE9 [15], it is still not uncommon to see UA of the following format:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; Microsoft Outlook 16.0.4498; ms-office; MSOffice 16)

3.2 Compatibility and spoofing

Large part of the User-Agent string is concerned with compatibility. It all began in 1996 when (presenting itself as Mozilla in UA) came with new features, including the idea of frames. is a part of a web page that is able to load the content independently of the rest of the document. Other browsers in that period (, for instance) did not supported frames. This was the origin of User- Agent sniffing. Developers started to examine UA string and they were sending the content with frames only to Mozilla browsers. When other browsers (Internet Explorer for instance) started to support these advanced features later on, they needed to refer to themselves as Mozilla to receive the correct content. It went the same with rendering engines. Gecko was an engine developed by the Mozilla Project. To announce compatibility with the engine, other browsers started to add "like Gecko" string in their UA. Konqueror was one such browser. In reality, it was based on another engine – KHTML developed by KDE project. Later, when WebKit was

20 3. User-Agent

created based on KHTML, the browsers using it (Safari and Chrome) started announcing compatibility with all three rendering engines [16]. User-Agent sniffing, along with lack of reliable information, and large number of applications violating RFC recommendations makes the User-Agent analysis a rather challenging task.

3.3 User-agents in MU network

Number of different User-Agents used today is immense and it still increases with the expansion of software on mobile device connect- ing to the Internet. There is a limited amount of desktop programs that frequently connect to the Internet and people usually use one or two browsers to access web servers. In a world of smartphones, it works differently. The most popular websites (Facebook, Instagram or various online games) are not accessed through the browser. Instead, they use their own applications and these applications usually have specific User-Agents.

Count User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 6575 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 4869 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 4812 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 4411 Firefox/40.1 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 2585 Firefox/51.0

Table 3.1: Popular browser-like UAs in MU network

32688 distinct User-Agent strings were collected on the network of Masaryk University on the 10th of March. Out of them, approximately 13000 belong to browsers. Table 3.1 shows the 5 most popular browser UAs and the number of different IP addresses that are using them.

21 3. User-Agent

The results are not surprising; most often, people were using some combination of Windows 7 (NT 6.1) or Windows 10 with Firefox or Chrome browser. Second table present the most commonly used non- browser User-Agents.

Count User-Agent 4977 urlgrabber/3.10 yum/3.4.3 4123 Microsoft-CryptoAPI/10.0 3240 Microsoft-WNS/10.0 3033 Microsoft-CryptoAPI/6.1 2975 urlgrabber/3.9.1 yum/3.2.29

Table 3.2: Popular non-browser UAs in MU network

In the first place is a User-Agent belonging to Linux software pack- age manager YUM. Urlgrabber is its subpart – Python package used for file fetching. This UA appears on the network during package installa- tion on RPM-based Linux distributions. It is followed by Microsoft- CryptoAPI, indicating usage of standard Microsoft cryptographic li- brary. For instance, this library is used shortly after start of the system. Windows connects to its PKI repositories and updates its certificate revocation lists [17]. Last unmentioned application is The Windows Push Notification Services (WNS). It enables third-party developers to send updates from their own cloud services [18]. Although some User-Agent strings are common for many users, they only present a minority. Only 0.6% of intercepted UA types ap- peared with more than 100 distinct IP addresses. On the other hand, 23849 User-Agents appeared with only one IP address.

22 4 Network monitoring

Network administrators usually want to have an overview of the traffic passing through their network. Network monitoring allows them to detect various anomalies, ongoing attacks or misconfigured devices. There are two main approaches to network monitoring: deep packet inspection and flow monitoring. During deep packet inspection (DPI), whole packets, not only headers are captured and used for analysis. Such approach provides administrator with the detail information about the traffic and allows enhanced detection methods. However, there are some disadvantages. It is computationally demanding, and if captured packets need to be stored, requirements for storage space are immense. Moreover, it raises some privacy issues. Deep packet inspection tools are represented by programs such as tcpdump [19], Wireshark [20] or The Bro Network Security Monitor [21]. On the other hand, only packet headers are examined during flow monitoring. Packets are grouped according to IP addresses, ports and used protocol to form a flow. Additionally each flow contains infor- mation about the time when it started, how many bytes and packets were transferred and other possible attributes. Classic flows are uni- directional, which means that request and response will be part of two distinct flows. Information that can be extracted from flows is not as detailed than in the case of DPI, but it still provides a useful overview of the network traffic. Analysis of flow data can provide answers to questions such as: "Which IP address generated the most traffic?", "With how many distinct IP addresses did user A communi- cate?". Moreover, it allows for the detection of various anomalies, such as port scanning. Advantage of this approach is decreased demands on performance and storage. The first widely adopted technology for flow monitoring wasNet- Flow developed by Cisco. The latest version, NetFlow v9 was also adopted as RFC standard [22]. This version also became a base for a new open standard IPFIX (Internet Protocol Flow Information Export) [23]. The newest versions of both mentioned formats are template- based, so it is possible to extend basic flows with some higher-level information (ex. DNS, HTTP).

23 4. Network monitoring

Figure 4.1: Common architecture for flow monitoring [24]

Figure 4.1 illustrates how a network deploying flow monitoring may look. The device generating the flows, also known as exporter, can be an ordinary router, or a standalone device called a probe. Such probe is usually connected to the network by TAP (device duplicating the traffic positioned in inline mode) or SPAN port (traffic passing through selected ports of the switch or router is mirrored to this port) [25]. Gathered flows are later sent to a collector where they are stored and can be used for further analysis. A flow is sent to the collector only when it has already finished. To determine end of the flow two thresholds are used:

∙ Inactive timeout – If no data belonging to the flow was sent during a predefined period, the flow is considered terminated and it is exported.

24 4. Network monitoring

∙ Active timeout – It is used to divide flows which are too long. In case it is longer than specified value, flow is exported and any new incoming packets become part of a new flow.

There is a wide range of tools used for flows monitoring, both open- source and commercial. YAF [26], nProbe [27] or Flowmon Probe [28] are example of available exporters. In the role of collector nfdump [29] or IPFIXcol [30] can be used. In some cases a single tool may incorporate functionalities of both, exporter and collector.

4.1 Experiment setup

All methods discussed in the thesis are verified practically on the net- work of Masaryk University. This section briefly describes the setup of the experiments, used tools and possible shortcomings. Extended HTTP flows are obtained using Flowmon probe processing raw pack- ets. The probe is able to create ordinary network flows, but also extend them with information from various application level protocols, HTTP being one of them. The probe exports HTTP method, URL, Host, User- Agent and Referer headers from the request message as well as Status Code and Content Type from the response. Exported flows could be both unidirecational or bidirectional (i.e HTTP response will be as- sociated with its request). Further, the probe can either create a new flow for every HTTP request, or aggregate requests belonging tothe same TCP connection. In the latter case, the URL, Referer and other fields are taken from the first request. Used data was obtained from two sources with slightly different settings.

1. One probe, on the edge of network and Internet, monitoring part of the incoming/outgoing traffic. It exports collected flows in JSON format that is later processed by the Python script (flow- Processing.py). This probe is set to create bidirectional flows, and these flows are divided based on HTTP data (one flow=one request).

2. Group of probes on the different parts of network, monitoring all traffic leaving from or coming to the university network. Ex- ported flows from all probes are sent to the collector (IPFIXcol)

25 4. Network monitoring

and they are stored in the fbitdump [31] database. Flows are uni- directional and all HTTP requests of the same TCP connection are aggregated to one flow.

Source of the data is chosen based on the requirements of a specific detection method. During detection of brute force, it helps to know a response code assigned to each requests, so the first setup is used. When observing network scanning, it is interesting to see URLs of all sent requests. Again, this can be achieved using the first setup. In all other cases, the second option is used.

4.2 Shortcomings of monitoring method

When the packet containing HTTP header is fragmented, the probe does not reassemble the fragments and information gathered about this HTTP request remains incomplete. This happens usually when the sent URL is excessively long. Effects of this fragmentation were observed during User-Agent analysis. Sometimes, strings obliviously belonging to valid browser UA appear, but they are cut at a seemingly random place: Mozilla/5.0 (compatible; MSIE 9.0; Windows Mozilla/5.0 (compatible; MSIE 9.0; Wind Mozilla/5.0 (compatible; MSIE 9.0; W Mozilla/5.0 (compatible; MSIE 9. Mozilla/5.0 (compatible; MSI Mozilla/5.0 (compatible; M Mozilla/5.0 (compatib Mozilla/5.0 (compat

Existence of this shortened UAs may slightly influence perfor- mance of some detection methods. For example, it may seem that the device used large amount of different UAs, but in reality, they are all the same. There is no special flag indicating concerned flows, soit is impossible to exclude such requests automatically.

26 5 Network Scanning

Many different attacks have in common their first step – network scanning. Often, the attacks are not targeted at a particular application on a specific server. On the contrary, the attacker knows about some vulnerabilities that he can exploit, and he searches for an opportunity. Suppose a new Wordpress (widely adopted content management system [32]) vulnerability was revealed. The attacker, or some infected devices under his control perform a scan of selected networks. He searches for devices with opened ports running an HTTP server. Once he has the list of HTTP servers he can contact each of them while trying to find out if they are running instance of Wordpress. This can be achieved by searching for files which are standard for WordPress (for instance wp-login.php). After a successful scan he can start with the exploit itself.

Port scanning Series of simple messages sent to well known ports of one or multiple devices in the network. The goal is to reveal running services on the device. Port scanning can be divided into two cate- gories: horizontal and vertical. During a horizontal scan, multiple IP addresses are scanned for a specific port number. On the other hand, a vertical scan targets numerous destination ports on a single host. Both techniques produce specific traffic patterns on the network, which makes them relatively easy to detect and even information contained in basic network flows is sufficient to reveal potential scanners [33].

HTTP scanning This term is often used in the context of antivirus software and it denotes the control of received HTTP traffic in order to find potential malicious content. However, for the purpose ofthis thesis the term HTTP scanning is used to describe behaviour similar to horizontal port scans, but on the application level. It is a search for a resource with the same name (same path) on large number of distinct servers. For instance, an attacker can search for a WordPress login page in order to perform a brute force attack. Ordinary network flows do not contain enough information to reliably identify these scans. Although, when flows are extended by HTTP fields the detection becomes straightforward.

27 5. Network Scanning

Directory traversal This attack can be considered an analogy to the vertical scan. It is done by searching for concrete file on various path on one server. Attacker usually tries to localise commonly used operating system files like /etc/passwd that are outside of the web server’s root directory. If the application does not properly validate user-supplied filenames, the attacker can use the sequence ../ to access a parent directory [34].

5.1 HTTP scanning

5.1.1 Incoming traffic Example of a successful scanning detection method was presented in 2015 after a reasearch conducted on MU network [35]. The authors assumed that guests usually access only a limited amount of hosts on the MU network, and the requested paths differ form host to host. They defined HTTP scan as follows: HTTPScan(A) ⇐⇒ A = {F|∀F, F′ : F(srcip) = F′(srcip) ∧ F(dstip) ̸= F′(dstip) ∧ F(path) = F′(path)} and |A| > threshold In other words, their method counted the number of distinct desti- nation IP addresses the user contacted with the same path set in a request. If the value crossed the threshold, it was considered as scan. Authors claimed that with the threshold set to one fifth of the number of web servers in the network, the method achieves good results and they did not observe any false positives. The same experiment was conducted as a part of this thesis with a slightly modified method. Instead of destination IP addresses the method counts number of distinct hosts. This number may be higher because more virtual hosts can be scanned on one IP address. Two special paths are omitted in the calculations: the root directory (’/’) and /robots.txt. This file is commonly accessed by web crawlers because it is designed to give them instructions about files that crawlers are authorised to access. The experiment uses data collected on the 31st of March by single probe on the edge of the university network. The probe only collects data from one part of the network and so the threshold is set only to

28 5. Network Scanning

20. The results confirm success of the method. It revealed 6 malicious users and one probable false positive. The attackers were looking for various resources, including: /wp-login.php //cgi-bin/test.cgi /xmlrpc.php /wordpress/ The highest number of distinct paths searched by one attacker was 15; all were related to WordPress. The highest number of scanned hosts was 53. It seems to be a low value, but one should take into account than only a small portion of the network is monitored by the used probe. The fact that whole HTTP URL (i.e. path + query) was used for the analysis revealed one attacker with a different motive. He contacted 47 servers with the following request:

/index.php?pg=ftp://genesys:[email protected]/envi.php?

A suspicious PHP code is still present on the specified location. The attacker probably wanted to find vulnerable servers that will download and run the file. Although this behaviour is not a network scan, it is definitely malicious.

5.1.2 Outgoing traffic If the method from the previous section is applied to outgoing traffic, it generates too many false positives. It is hard to predict how many distinct servers are visited by an ordinary user throughout one day. Moreover, there are files located on a standard path of many servers, and regularly requested by users. .ico is an example of such file. It is usually located in the root directory and it contains asmall icon associated with the website which is usually displayed in the browser’s address bar. To adapt to different conditions and to decrease the number of false positives, the original method was modified in this thesis. Changes are based on two main assumptions. Firstly, an attacker usually does not fill the Referer field during a network scan, while in normal traffic the field is usually set. Of course, this does not apply to 100% of casesand it can lead to unnoticed scans, but on the other hand, it significantly decreases the number of false positives. The second assumption is

29 5. Network Scanning that a normal user visits more subpages of the server he accesses (at least additional requests for JavaScript and CSS files). On the other hand, the attacker sends just one request to each host during the scan. Actually, he can send more requests if he scans more distinct paths, but he probably does not visit any page out of his scanning list. The new detection method mimics the behaviour of the original method and creates a list of potential attackers in its first step. However, only requests without Referer are taken into account. Then, a scanning list is created for each attacker, which is a list of paths he visits repeat- edly. In the next step, the method checks the number of distinct paths for each attacker and each visited hostname. If the user only accesses paths from his scanning list on most of the visited websites, he is considered malicious. On the contrary, if he requests various other resources on most of the websites, he is considered benign. Method implementation can be found in file scanDetection.py which is part of electronic attachment.

Number User Requested path of hosts A 1221 /wp-login.php A 241 /xmlrpc.php B 27944 /wp-login.php B 14512 /xmlrpc.php B 58 /cgi-sys/suspendedpage.cgi B 109 /index.html B 109 /not_found B 86 /wordpress/wp-login.php B 100 /wp/wp-login.php /t51.2885-19/11906329_ C 35 960233084022564_1448528159_a.jpg D 36 /favicon.ico E 39 /server-status

Table 5.1: Scans of the external network

The table 5.1 presents scans of the external network detected with the threshold set to 30. Only last three events are probable false posi- tives and they can be eliminated by slightly increasing the threshold. A problem of this method is that for extensive scans (>10000 of hosts), it is time-consuming to count the number of distinct visited paths on each host. However, these extensive scans probably do not need to be

30 5. Network Scanning verified at all and the verification can be only made for smaller scans that can be mistaken for legitimate traffic.

5.2 Directory traversal

Discovery of this type of attack is straightforward. It is enough to search for the sequence ../ (eventually ..\ on Windows based systems). However, initial experiments shows that not every occurrence of this sequence in the URL is malicious. Two representative example of non-malicious requests are explained below:

is.muni.cz/predmety/predmet.pl?id=340448;zpet=../predmety/katalog.pl Sequence ../ is placed in the query part of the URL. Server saves the path from which a user accessed this subpage into a variable zpet. The saved path is later used once the user clicks on the link: Back to the previous page. There is nothing malicious about such URL as this path traversal was the intention of the server. The second example introduces another regular use of this sequence:

www.parea.sk/sidebar/loga-partneru/image_sidebar.php?file=../../images/loga/tatralandia.jpg When someone visits site www.parea.sk, his browser automatically sends this request, because the original page uses ../ notation in its source code to refer to images. Although reference to the root directory may appear in the URL, it does not happen frequently. Moreover, attackers usually need to tra- verse more directories during the attack, so the sequence is repeated. However, no ordinary request that was caught used more than two repetitions. The created detector uses a regular expression to search for at least 3 consecutive sequences. Over 24 hours, it captured 136 re- quests and all of them were part of malicious activity. A short example of captured requests follows:

portal.dis.ics.muni.cz/cms/index.php?page=../../../../../../../../../etc/passwd portal.dis.ics.muni.cz/static/../../../../../../../../../etc/passwd. portal.dis.ics.muni.cz/..%5c..%5c..%5cboot.ini

The most frequently accessed files were: /etc/passwd, boot.ini and wp- config.php

31 5. Network Scanning 5.3 Summary

The chapter describes how to detect HTTP scans with the help of ex- tended network flows. If a security administrator intercepts a scan ofan internal network, he can block responsible IP address and so prevent a possible attack that may follow the scan. Detection of scans in this direction is simple, but the chapter also presents a modified method targeted to outgoing traffic. The method successfully revealed devices on university network that were scanning external IP addresses. Such behaviour indicates either malicious user on the MU network or, more probably, an infected device. Later on, section 6.3 will prove, that at least one of these devices was infected. The last part of the chapter presents a slightly different concept. Directory traversal is a vulnerability that allows the attacker to access restricted files. It appears in this chapter because it resembles a vertical scan. Performed experiment has shown that directory traversal attack occurs often and it can be successfully revealed using a simple regular expression.

32 6 Brute Force Attacks

A large number of web pages provides a way to authenticate the user. The most straightforward method is to use a combination of username and password. When users demonstrate the knowledge of the right combination they are authenticated and they can access secured web pages. This authentication method can become a target for attackers trying to guess the correct combination of username and password. When the attacker systematically checks all possibilities, it is called a brute force attack. This type of attack is resource and time consuming. To try all possibilities for passwords which are 8 characters long and only use lower-case letters of English alphabet, the attacker needs to test 268 = 208827064576 passwords. Similar attempt is easily de- tectable because a normal user will not send such a large amount of request to one website. However, there are more efficient ways to perform a similar at- tack. The password picked by the user is usually not a random string. Passwords like: "12345", "admin", "password" or "qwerty" are very common even though it is evident they are insecure. The only thing an attacker needs to do is to find a list of the most commonly used passwords and try it. This strategy is known under the name dictio- nary attack. It reduces the number of guesses the attacker needs to attempt to thousands or even hundreds. As a result, the detection is not as straightforward. This chapter relies on two research papers focusing on HTTP flows. The first one [35] was conducted at Masaryk University. The authors have shown that when flows are aggregated on the source and destina- tion IP address and URL (host + path), records with excessively larger number of flows (few thousands) can mostly be attributed to brute force attacks or to a communication between the client and the proxy server. However, dictionary attacks using small number of requests (several hundreds) will stay hidden in legitimate traffic. The second related paper [36] suggests to use the size of HTTP requests to further narrow down the criteria to detect potentially malicious traffic.

33 6. Brute Force Attacks 6.1 Targeted authentication methods

To propose a successful detection method, it is essential to distinguish between different types of authentication that can be abused bythe attacker. Possible authentication mechanisms are as follows:

∙ Native HTTP authentication – Authentication process is part of the HTTP protocol and it comes in two variants, Basic and Digest authentication. It uses a dedicated window external to the web page itself which requests user credentials. These credentials are placed in the request header, and information regarding whether the process was successful or not is contained in the status code.

∙ Form-based authentication – This is the most widely used au- thentication method. Credentials are entered to a web forms and they are processed by the application running on the server. It is up to the developer whether the content of these forms will be sent using GET or POST request. POST is more secure and it is used significantly more often.

∙ XML-RPC – It allows to make remote procedure call through the Internet. A procedure call is incorporated in an XML file sent through HTTP. This functionality can be found in Wordpress, for instance. It is not primary used for user authentication, but it can be misused by the attacker. Some XML-RPC calls require the username and password and then they provide confirmation of whether the credentials are correct to the user [37].

6.2 Attacks against HTTP based authentication

This category of attacks is the easiest to detect. Status code 401 Unau- thorized is returned by the server if the user attempts to access re- stricted web content without an Authorization header, or if the header contains invalid credentials. A reliable indicator of compromise is a large amount of HTTP flows from the same source IP address tothe same resource (host + path) with a response 401. All proposed detection methods for brute force attacks were eval- uated on data collected throughout 3 days (30. 3. – 1. 4. 2017) on one

34 6. Brute Force Attacks probe located at the edge of the MU network. The probe was set to create bidirectional flows. These are indispensable as the status code assigned to the request is needed for the detection. The threshold on the number of flows was set to 100. On the 30th of March two events were detected (see table 6.1). On both specified URLs there was actually a prompt for authentication. Results from the other two days show a similar outcome.

Flows From To Path 189 MU network Internet /chat/chat/channels64/3349 1009 Internet MU network /admin/advanced

Table 6.1: Detected attacks against naive HTTP authentication

6.3 Attacks against authentication using POST method

Malicious authentication using the POST method is the most com- mon, therefore the detection method against these kinds of attacks is given the most attention. The initial strategy is based on the following assumptions:

∙ Form based authentication is not a part of HTTP protocol. Even a request with invalid credentials is a valid HTTP request. Ex- pected status code is 200 OK.

∙ Only a small amount of information needs to be delivered to the server. In the most simple case, the content of only two input fields needs to be sent, therefore number of sent bytes willbe limited. This assumption is supported by [36]; in their study, authors tested various brute force attacking tools against most popular content management systems and measured the num- ber of bytes and packets the tool used. They estimated the lower (363) and upper (1130) boundaries on number of sent bytes.

∙ During one attack, number of sent bytes per request is stable. It may vary slightly according to the size of the username and password.

35 6. Brute Force Attacks

∙ Number of bytes sent in a response should stay stable as well. The server answers with the same HTML page announcing that the password was incorrect.

To estimate the variability of the number of sent bytes, a coefficient of variation is used as a metric. It is a statistical measure of the dispersion of data points around the mean, defined as the ratio of the standard deviation to the mean [38]. σ cV = µ For this task it performs better than variance. As it is relative to the mean, it can be used to compare the spread of datasets that have different means (web pages of different sizes). Prior to computing cV, outliers were eliminated from the dataset using Z-score test [39]. This step prevents computed variability from being influenced by a small number of unexpectedly longer or shorter flows. In the first attempt the cV threshold was set to 10% for both incoming and outgoing bytes. The detection method was applied to data collected on the 31st of March. If the number of flows from one IP address to one resource exceeded 100 and all mentioned conditions were satisfied, the flows were marked as a potential attack.

Result of the first experiment In total, 49 events were reported; 14 against servers inside the university network and 35 potential at- tacks coming from the university to the Internet. Only one incom- ing attack seems to be a false positive – 1046 requests to the server loschmidt.chemi.muni.cz. On this server, a publicly available classifier for prediction of the effect of amino acid substitutions is running. This explains the large number of POST requests. 13 other attacks were either against the XML-RPC service or against Wordpress login page situated on the standard path /wp-login.php. The largest of the attacks consisted of 10090 requests against the server blokexpertu.cz/xmlrpc.php. Most of the outgoing flows marked as attacks were false positives. Among them, there were a few game servers, advertisement-related servers and some unclassifiable web sites. Only one event seemed suspicious – 121 requests for velatice.cz/index.php. There is a login form on the requested path of the website.

36 6. Brute Force Attacks

Manual attack search In the next step, POST requests between a single host and a resource appearing more than 100 times in the same day were reviewed manually. Because of the large volume of data, only requests towards the university network were checked to find attacks that could have been overlooked by the detection method. The manual inspection revealed 24 probable attack attempts, twice as much as the automatic method. The main characteristics of undetected attacks are summarized in the table below.

Status Request bytes Response bytes Flows Src Host Path code median cV median cV 235 A www.socstudia.fss.muni.cz /wp-login.php 404 918 19.7% 34506 1.6% 122 B psych.fss.muni.cz /wp-login.php 404 988 10.8% 45103 0.4% 239 B psychoterapie.fss.muni.cz /wp-login.php 404 520 0.4% 3088 0.0% 240 B socstudia.fss.muni.cz /wp-login.php 404 675 8.1% 15088 0.0% 250 C impact.cjv.muni.cz /xmlrpc.php 301 681 0.4% 732 0.0% 420 D katedry.ped.muni.cz /wp-login.php 301 586 0.6% 543 0.0% 389 E 90.muni.cz /wp-login.php 403 820 1.2% 554 0.05% 10104 F blokexpertu.cz /xmlrpc.php 200 1306 0.5% 866 0.0% 103 G koncepce.knihovna.cz /wp-login.php 200 551 5.0% 5969 11.3% 7277 H blokexpertu.cz /xmlrpc.php 200 1286 0.3% 866 0.06% 803 I polit.fss.muni.cz /xmlrpc.php 200 85526 0.4% 62012 3.1%

Table 6.2: Attacks unnoticed during the first experiment

The first four events in the table seem to be unsuccessful attack attempts. The attacker probably wanted to attack the Wordpress login page without testing that this login page is actually present on the provided path. The fact that the attacker tried the same thing on 3 different sites enforces the idea that his behaviour is malicious. In next two lines, the response code is 301 Moved Permanently, requests for these resources are redirected to use HTTPS protocol. Whether the attacker later tried the attack through HTTPS cannot be verified since HTTPS data was not collected. The next four events just slightly crossed thresholds on either stability or the median of the amount of bytes. However, the last event crossed the threshold on sent bytes significantly. According to [40], it can still represent an attack. There is a possibility that the attacker used the XML-RPC system.multicall method that allows to execute multiple methods inside a single request and the attacker can make hundreds of guesses inside one request.

37 6. Brute Force Attacks

To cover all these attacks, the original criteria needed to be lowered. However, that brings additional false positives.

Similarities between attacks To increase accuracy of the detector, all captured attacks were analysed to find their common properties.

∙ Timing – The duration of the attacks vary between 10 minutes to 20 hours, and the frequency of requests from 10 per second to 20 per hour. As a result, timing was not chosen as a metric.

∙ Referer – During all observed attacks, the Referer header was either not set or it was always the same. This assumption may reduce false positives related to advertisement tracking sites. Requests to such sites often differs in the Referer field.

∙ Content Type – The answer to an unsuccessful login is normally a simple HTML page, or XML file in case of an XML-RPC attack. In both cases, the content type contains string "text". An example of false positive that will be excluded if Content Type is checked is a request for application on loschmidt.chemi.muni.cz returning JSON format.

∙ Number of visited paths – When the attacker performs a brute force attack, he usually focusses only on one resource on the server. He does not use the browser, so additional resources (JavaScript or CSS files) are not automatically requested, and he does not browse other subpages. The disadvantage of this assumption is that it does not hold for non-automatic attacks where the attacker first explores the website and then chooses an attack technique.

∙ Missing similar requests from other users – If a specific re- source is accessed repeatedly with POST request from larger number of distinct IP addresses, it is most likely not an attack. It is probable that there is some specific service running on target server that requires repeated POST requests. For example, 30 dif- ferent addresses that requested dmd.metaservices.microsoft.com/ dms/metadata.svc more than 50 times a day are not considered malicious. On the other hand, if no other device performs a large

38 6. Brute Force Attacks

amount of POST requests for the resource, the one that does is suspicious. This assumption may limit disclosure of distributed brute force attacks and so it is better to apply it only to outgoing traffic.

Adapted detection method Information gathered during previous steps was used to improve the detection. Enhanced detector uses the threshold 20% for coefficient of variation in the number of bytes per request, 15% for bytes per response and no upper boundary on message size. Referer needs to either not be set or contain always the same value, and string "text" should be present in the response Content Type. The detector checks for status code 200, 403, 301 and 404 (it is questionable whether to check for attacks on non existing resources, but it is still a sign of malicious intentions so it may be useful). Data obtained in the first step is further analysed to limit false positives. The event is not reported if the same resource was contacted by at least 5 distinct devices more than 50 times or if the source IP address contacted at least 10 distinct paths on the target host. The method was tested on data gathered throughout 3 days and the results are presented in the table 6.3. The improved method did not miss almost any known attack and at the same time, significantly decreased the number of false positives.

Incoming traffic Outgoing traffic Date Events True positive False Positive False Negative True Positive False Positive 30th of March 32 10 0 2 1 21 31st of March 38 24 0 0 1 13 1st of April 21 10 0 1 0 11

Table 6.3: Results of brute force detection using the enhanced method

On the 30th of March interesting behaviour was revealed. 424 re- quests for uromatalieslave.space/index.php are not caused by a brute force attack but they are still malicious. It is the communication of the infected device with command and control center. The device is infected by Sathurbot backdoor trojan [41] that is spread through a Baywatch 2017 movie torrent file. The malware was discovered because of the host it contacted. This malware generates list of hostnames and then probes them for the presence of Wordpress. Once it finds one, it

39 6. Brute Force Attacks tries to login with predefined credentials. According to the analysis by ESET [41] it tries only one password and moves on. However, the captured data indicates that the malware contacts the discovered lo- gin page several times, but never more than 40 and that is the reason why these attempts were not captured by the brute-force detector. On the other hand, the first part of its attack was revealed by HTTP scan detector.

6.4 Attacks against authentication using GET method

In case of a brute force attack against GET forms, the interesting flows are the ones that share the same IP address, host and path. The query part of URL is changing because credentials are placed there. An attack URL will contain at least two parameters (username and password) but most likely it will not be much longer. Login form usually appears on the main page of the website and no special parameters are part of the URL. 150 characters seems to be a reasonable upper boundary. Similarly to the previous attack category, the expected length of the answer should not vary too much (cV < 10%) and the expected answer is 200 OK. The assumptions about Referer and Content Type from the previous section are also applicable. When applied to the data from the 31st of March, the detector reported 46 events. Several suspicious URLs are given below:

/mshowitemsajax.php?tuser=sdhr1192&d=1490991182996&ver=4 /mshowitemsajax.php?tuser=1919ayut&d=1490990112326&ver=4 /mshowitemsajax.php?tuser=1919ayut&d=1490988895764&ver=4

/kurikulum/index?do=login§ok=8501bcd344e53fb92792ba6d8167ae14 /kurikulum/index?do=login§ok=6a9e4f5a48015b09d1a82a1a050beb0d /kurikulum/index?do=login§ok=6d61e4f81481648ca7e78ed48b66ff5f

All of them are benign and do not represent any authentication attempt. To refine the detection method, examples of real world attacks are indispensable. The manual inspection of flows used in the previous section is not realisable due to a much larger amount of data. Instead, only repeating flows containing strings "pass", "login" or "user" were reviewed manually. This inspection of the flows from 3 days revealed

40 6. Brute Force Attacks

only one brute force attack on the 30th of March with URLs of the following format:

/phpmyadmin/index.php?pma_username=root&pma_password=host /phpmyadmin/index.php?pma_username=root&pma_password=123456 /phpmyadmin/index.php?pma_username=root&pma_password=backup

The example above confirms all the assumptions made before and therefore it validates them as correct. Unfortunately, one attack example is not enough to confirm or reject success of the detection method. Because this type of attack is not as popular as POST form attack, real world examples are missing in the dataset. Moreover, the attack URL is almost indistinguishable from a valid HTTP request. For these reasons, this thesis will not further focus on this type of attack.

6.5 Summary

To conclude, HTTP extended flows can significantly help with detec- tion of brute force attacks. Although, the method was not too success- ful with attacks against GET forms authentication, it behaved much better with the most popular authentication – based on POST forms. It is impossible to determine the exact precision and recall values for the method in the absence of labelled data, but at least for incoming traffic, the method managed to reveal highly probable attacks with almost no evident false negatives. Performance of the method may be increased if it is adjusted to a specific network, for example by whitelisting, however it will never be able to reveal all possible types of brute force attacks.

41

7 Code injection

Web applications usually process some type of user input, for example values filled in forms. Every input should be considered untrusted and thus properly validated. If the application programmer does not take sufficient precautions, an attacker can create a malicious input that is able to modify application logic. The inserted code can help the attacker get access to private data, harm subsequent users of the application or even gain total control over the server. To achieve code injection, the attacker can try to insert data in a different format than expected, include characters which have special meaning in a particular programming language, or send overly long strings. This malicious data can be placed in HTTP forms and sent through POST request, but they can also be placed in a URL query parameter or in some HTTP headers. Extended HTTP flows allow the detection of such behaviour if malicious data is part of the URL, User-Agent or Referer. There are multiple variations of code injection attacks. The most popular ones, SQL injection and cross-site scripting, are analysed in this chapter.

7.1 SQL injection

Most of the web applications use databases and run an SQL (Struc- tured Query Language) server. These servers are the targets of SQL injection attacks. During such attacks, malicious SQL commands are sent instead of the ordinary user input. If the input is not properly processed, the attacker can obtain, delete or modify sensitive data in the database. Although the main idea of the attack is already known for almost 30 years (it was first documented in 1989 in online magazine for hackers – Phrack [42]), a large number of susceptible servers is still deployed worldwide. SQL injection vulnerability is regularly ranked between top 10 most critical web application security flaws. The list is created every few years by The Open Web Application Security Project (OWASP)[43].

43 7. Code injection

7.1.1 Basic principle Simple login form is taken as an example in order to describe basic idea of the attack. The user inserts his username and password and content of the fields is sent to a server. The code running on the server side connects to a database and calls a query similar to:

SELECT * FROM userInfo WHERE user = ’$username’ AND pass = ’$password’

If an ordinary user types his credentials, a harmless query is sent to the database. When the database returns an entry, the user is consid- ered authenticated. However an attacker can insert malicious string in the login form, for example ’ OR ’1’=’1’ --. The final query completely changes the meaning:

SELECT * FROM userInfo WHERE user = ” OR ’1’=’1’ --’ AND pass = ’$password’

Double dash indicates a comment so the last part of the query is ignored. Logic formula becomes always true and the database returns entries even though the password is not correct.

7.1.2 Attack types SQL injection can be detected with the help of regular expressions (RE), but to make them efficient, one needs to know how possible attacks can look. A nice overview can be found in [44]. Authors defined the following attack categories:

∙ Tautologies – Attacker transforms a condition to tautology in order to force a database to return all table rows. A demonstrative example was presented in the previous section.

∙ Illegal/Logically Incorrect Queries – Attacker inserts a query that will cause database error on purpose. If such error is propa- gated to the user, it can reveal important information about the type of the database or the names of the tables.

∙ Union Query – Union operator enables the attacker to get data from an arbitrary table in a database. Example: UNION SELECT cardNo FROM CreditCards WHERE acctNo=10032 --

44 7. Code injection

∙ Piggy-Backed Queries – Attacker do not transform the original query, instead he closes it and inserts a new one. The new query can be any SQL command, including call of a stored procedure. Example: ’; DROP TABLE users --.

∙ Blind injection – This type of attack is used when the appli- cation does not provide sufficient visible feedback from the database. The attacker observes the differences in the web page behaviour if the condition is validated to either true or false and then he can ask yes/no questions. A special type is a timing attack, when the attacker inserts a time delay in order to distin- guish between two possible outputs. In the following example, if the current user is root, the server response will be delayed by 5 seconds. Example: 1 AND IF(SUBSTRING(USER(),1,4)=’root’, SLEEP(5), 1) --

∙ Alternate Encodings – Attackers use various types of encoding to avoid detection. For example, char(120) is a different represen- tation of the character "x", or the string exec(0x73687574646f776e) is interpreted as exec(SHUTDOWN).

7.1.3 Detection To prevent SQL injections on the server, the developers need to recog- nise any potentially malicious string. However, an attack detection is simpler than prevention. Attackers usually use a combination of multiple approaches to reach their objectives and if any of the sent strings are caught by the detector, the attackers are exposed. The first conducted experiment used a regular expression taken from scripts written for The Bro Network Security Monitor [21]. RE checks for quotes and various reserved words such as: SELECT, DELETE, UNION, AND, OR, CONVERT and so on. Throughout one day, more than 4000 of possibly malicious URLs were detected. Ap- proximately half of them have proven to be benign. An interest- ing category of false positives were the 500 requests to the server query.yahooapis.com originating from various IP addresses. The follow- ing example shows part of the URL after removing %-encoding:

&q=select * from partner.news.feeds where category = ’health’ and region = ’gb’

45 7. Code injection

Yahoo provides developers of applications with a single interface to access data across the web. URLs sent by such applications often contain sequences of Yahoo Query Language (YQL) [45] that have SQL- like syntax. It is difficult to automatically distinguish such requests from SQL injection attempts. Therefore the best solution is to put the server query.yahooapis.com on a whitelist. Besides queries using YQL, regular expression chosen for the testing have also marked requests related to European Union as suspicious, or many others containing reserved SQL words. Part of the thesis is a new detector that enhances Bro regular ex- pressions. Three most important modifications are:

1. Regular expression matches suspicious words even if some char- acters are %-encoded.

2. Queries in SQL contain more than one special word. UNION is used with SELECT, SELECT with FROM, and so on. This fact is taken into account to reduce the number of false positives.

3. Host whitelist is introduced. For now, it contains only one rule that excludes requests for hostnames containing substrings query and yahoo

Modified regular expressions are included in the electronic attach- ment (modifiedRE.txt), and they can be run using created Python script, patternMatching.py. Table 7.1 shows the comparison between the two methods. The improved regular expressions revealed more than 150 new attack attempts and significantly decreased the number of false positives.

True positive False positive Bro RE 2365 2458 Modified RE 2631 50

Table 7.1: Comparison of methods used for SQLi detection

URLs sent in HTTP requests do not have any restrictions on their length. However, it may be convenient to store only part of the URL with a fixed length to reduce requirements for storage space. Moreover,

46 7. Code injection

shorter URLs can be process faster by REs. The next experiment deter- mines how many characters from the URL are usually needed to detect an SQL injection. The results presented in figure 7.1 suggest that 128 characters is enough. With the URLs shortened to only 128 characters, the detector was still able to reveal more than 98% of malicious traffic. However, with shorter samples, the success rate diminishes quickly.

Figure 7.1: Success of the detection method on URLs of various length

The conducted experiments cannot prove that the chosen method detected all attacks on the network, because the number of requests throughout a day is too large to allow manual inspection. However, it showed that a significant amount of injections can be detected with the help of the proposed regular expressions with a reasonable false positive rate (~1%). Moreover, the second experiment confirmed that full length of the URL is not indispensable for the detection.

7.2 Cross-site scripting

XSS is another popular attack that makes use of insufficient input validation. However, its main target is not an HTTP server, but a client browser. Attacker creates a malicious client-side code (usually

47 7. Code injection

JavaScript), and tries to inserted it into a web page viewed by other users. This attack may appear in two forms: non-persistent/reflected and persistent. During a reflected attack, the attacker creates a URL with the malicious code, for instance: http://vulnerable.com?search=

The attacker then needs to deliver the URL to the victim (send by , put the link on some other website). Once the victim follows the link, his browser will download malicious JavaScript code and run it. This code can, for example, steal the victim’s session cookies and send them to the attacker. To perform persistent XSS, the attacker needs to store the malicious string on the server. The string should become a part of the web page delivered to other users, for example as a comment in a forum.

7.2.1 Detection Basic XSS attack involves a /cgi-bin/cgicso?query= /listserv/wa.exe?SHOWTPL=

Most of the other requests are related to various advertisement websites like googlesyndication.com, adkmob.com or openx.net. URLs are usually long and complicated, and they often contain a link to another website. It is not easy to understand the purpose of the code inside and it is possible that they also represent malicious traffic. Even those with a more simple format are hard to interpret correctly. One decoded URL sent to us-u.openx.net is shown below:

/w/1.0/pd?plm=5&ph=d7066e05-92d3-4e83-b4f2-cbee552a2f6b>