City Research Online

City Research Online City, University of London Institutional Repository Citation: Acarali, D., Rajarajan, M., Komninos, N. and Herwono, I. (2016). Survey of Approaches and Features for the Identification of HTTP-Based Botnet Traffic. Journal of Network and Computer Applications, doi: 10.1016/j.jnca.2016.10.007 This is the accepted version of the paper. This version of the publication may differ from the final published version. Permanent repository link: https://openaccess.city.ac.uk/id/eprint/15580/ Link to published version: http://dx.doi.org/10.1016/j.jnca.2016.10.007 Copyright: City Research Online aims to make research outputs of City, University of London available to a wider audience. Copyright and Moral Rights remain with the author(s) and/or copyright holders. URLs from City Research Online may be freely distributed and linked to. Reuse: Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way. City Research Online: http://openaccess.city.ac.uk/ [email protected] Journal Logo 00 (2016) 1{19 Survey of Approaches and Features for the Identification of HTTP-Based Botnet Traffic Dilara Acaralia, Muttukrishnan Rajarajana, Nikos Komninosa, Ian Herwonob aSchool of Engineering and Mathematical Science, City University London, London, United Kingdom. bSecurity Futures Practice, Research & Innovation, British Telecom, Ipswich IP5 3RE, United Kingdom. Abstract Botnet use is on the rise, with a growing number of botmasters now switching to the HTTP-based C&C infrastructure. This offers them more stealth by allowing them to blend in with benign web traffic. Several works have been carried out aimed at characterising or detecting HTTP-based bots, many of which use network communication features as identifiers of botnet behaviour. In this paper, we present a survey of these approaches and the network features they use in order to highlight how botnet traffic is currently differentiated from normal traffic. We classify papers by traffic types, and provide a breakdown of features by protocol. In doing so, we hope to highlight the relationships between features at the application, transport and network layers. c 2016 Published by Elsevier Ltd. Keywords: Bot, botnet traffic, network analysis, feature analysis, network-based detection 1. Introduction a server. In contrast, both IRC and HTTP-based botnets operate with a centralised architecture, where bots are re- The digital age has brought many benefits to society, quired to connect to a command and control (C&C) server with the Internet acting as an enabler for growth and de- in order to upload data or receive commands. For this velopment across practically all sectors of business and in- study, we have chosen to focus on HTTP-based botnets. dustry. Unsurprisingly, it has also become an attractive The wide-spread use of web-based services has pushed the and fertile environment for criminal activity. Malware pro- adoption of HTTP as an advantageous communication grams, which may take many forms, are now frequently protocol thanks to the fact that it is usually permitted used for financial theft, identity theft, espionage, and dis- by firewalls, and it allows bot traffic to blend in with vast ruption of services. A particularly troublesome type of volumes of benign activity. The benefits of using HTTP malware is a bot, an exploited system which acts as a re- rather than IRC are demonstrated and discussed by Farina mote tool for an attacker to control and use the resources et al. (2016) and Gu et al. (2008). Additionally, McAfee of a target system. Typically, the attacker (called the bot- (2015) listed at least 5 HTTP-based botnets as top spam- master) will do the same for multiple systems and then use mers in their 2014 Q4 Threat Report. This was also echoed them collectively, in what is known as a botnet, to launch by Symantec (2014), where HTTP-based botnets made up attack campaigns. over half of their 2014 top 10 spamming botnets. Botnet classification is based on the type of communica- The difficulty of detecting botnet traffic (especially for tion protocol used. The main classes currently defined by HTTP-based botnets) amongst web activity is a current the research community are P2P-based, IRC-based, and research challenge. Rostami et al. (2014) compared clean HTTP-based. P2P-based botnets use a decentralised ar- network traces to those collected from HTTP-based bot- chitecture, in which each node can act as both a client and nets. They focused on HTTP data in client-side PCAP files to highlight differences in clean and malicious traffic at Email addresses: [email protected] (Dilara the application layer. However, they do not consider how Acarali), [email protected] (Muttukrishnan Rajarajan), [email protected] (Nikos Komninos), this traffic may manifest at other layers, nor how it may [email protected] (Ian Herwono) be measured. Both Haddadi & Zincir-Heywood (2014) and 1 D. Acarali et al / 00 (2016) 1{19 2 Beigi et al. (2014) study the effectiveness of flow-based fea- 2. Behaviour of HTTP-Based Botnets tures in combination with machine learning techniques for botnet detection. The former analyses the use of flow ex- In order to interpret the traffic generated by bots, we porters (used to extract and aggregate flows), whilst the first need to understand the activities that they may be en- latter considers the use of different flow-based features as gaging in. In this section, we consider sequences of events inputs to machine learning algorithms. Whilst they pro- in a typical botnet's lifecycle, and split them into groups vide interesting insights, these works consider neither fea- based on the aims behind those events. We then consider tures from the application layer to help to characterise the what kind of traffic may be observed for each of these malware, nor the behaviours which may underpin them. groups. Meanwhile, Garcia et al. (2014) conduct a general survey on network-based detection methods, including a categori- 2.1. Lifecycle sation of algorithms and in-depth analysis of key works. Botnets inherently exhibit similar behaviours over the However, they do not identify the specific features those course of their active lifespan. These behaviours can be ab- approaches use. stracted into a number of sequential stages to form a model The aim of this work is to provide a study of the char- of the botnet lifecycle. The stages may be slightly different acteristics of HTTP-based botnet traffic and the types of depending on whether behaviours are viewed from the per- features used to measure or detect them. Such features or spective of the botmaster or of the defenders. Rodriguez- identifiers are the inputs to many detection mechanisms, Gomez et al. (2013) consider the former approach, and regardless of the methodology used. Therefore, it is im- define the lifecycle as Conception, Recruitment, Interac- portant to have a clear view of both what these identifiers tion, Marketing, Attack Execution and Attack Success. may be as well as how they fit together. To achieve this, we Conception covers the initial design and development of conduct a survey of recent network-based detection meth- the malware, including the botmaster's motivations. The ods, with specific focus on the types of traffic they observe Recruitment stage is the infection process through which and the parameters they define for the distinction of be- new bot victims are acquired. Next, the Interaction stage nign and malicious network data. Identified features are is where communication channels are established, includ- abstracted into sets, and classified by protocols and the as- ing those between the bots and C&C servers. Marketing sociated layers of the OSI model. Our goal is to enable a is (as the name suggests) the advertisement of the botnet better understanding of the nature of HTTP-based botnet or its services. Attack Execution will then depend on the traffic, including the underlying behaviours which cause it chosen campaign, and the end of this stage will be marked and the implementation of different protocols. (Note that by Attack Success. the term \HTTP-based botnet" refers to those that com- Meanwhile, the stages as defined by Silva et al. (2013) municate with C&Cs over HTTP, but they may use other are Initial Infection, Secondary Injection, Rally, Malicious protocols for other activities). Activities, and Maintenance. Initial Infection is the stage The main contributions of this paper are as outlined where the bot malware first comes into contact with a below: vulnerable node, and roughly aligns with the previous model's Recruitment stage. Then, a Secondary Injection • A survey of recent network-based detection ap- is triggered, where the executed malware downloads proaches for HTTP-based botnets additional configuration files. Rally denotes the initial • A survey of traffic-based features used to detect bot connection attempts made by bots to C&Cs from which traffic they will receive further instructions. This aligns with Interaction in the previous model. The commands of the • An abstraction of the main types of features in rela- botmaster will dictate the botnet's behaviours during tion to protocols and OSI layers Malicious Activities (similarly to Attack Execution and Attack Success). Maintenance covers the process of The paper is organised as follows: Section 2 will provide updating or patching the botnet. a background on the observed behaviours of HTTP-based botnets and the traffic which is generated as a result of The lifecycle conceptually encapsulates botnet be- those behaviours.

Load more