MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨  OF I !"#$%&'()+,-./012345

Framework for Easy Malware Analysis

BACHELOR’STHESIS

Radoslava Povalov´a

Brno, autumn 2014 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Advisor: Mgr. V´ıt Bukacˇ

ii Acknowledgement

I would like to thank my supervisor Mgr. V´ıt Bukacˇ and my con- sultant RNDr. Vaclav´ Lorenc for their continuous support, guidance and valuable feedback which helped me to write this thesis.

iii Abstract

The primary purpose of this thesis is to study tools which are used for malware deobfuscation and to implement web application for de- obfuscation and detection of malware. The web application allows set type of deobfuscation methods, their keys and order of them. It also supports automatic analysis without necessity to set analysis configuration. The application uses pattern matching system YARA for malware detection.

iv Keywords malware, deobfuscation, Python, web application, , cluster, Cel- ery, YARA

v Contents

1 Introduction ...... 3 2 Malware Overview ...... 5 2.1 General Malware Categories ...... 5 2.2 Malware Lifecycle ...... 6 2.3 Examples of Malware Lifecycle ...... 8 2.3.1 New Malware Sample Example Scenario . . . . 8 2.4 Obfuscation ...... 10 3 Analysis ...... 12 3.1 Overview of Existing Tools ...... 12 3.2 Specification of Requirements ...... 14 3.2.1 Functional Requirements ...... 14 3.2.2 Non-fuctional Requirements ...... 15 4 System Design ...... 17 4.1 Available technologies ...... 17 4.1.1 Web Framework ...... 18 4.1.2 Database ...... 18 4.1.3 User Interface ...... 19 4.1.4 Parallel Processing ...... 21 4.1.5 Deobfuscation Heuristics ...... 23 5 Implementation ...... 24 5.1 User Interface ...... 24 5.1.1 Home Page ...... 24 5.1.2 Administration ...... 24 5.1.3 Analysis Configuration ...... 24 5.1.4 Details of a Result ...... 25 5.2 Application Layer ...... 25 5.2.1 Flask ...... 27 5.2.2 Celery ...... 28 5.2.3 Creation of a New Deobfuscation Analysis . . . 28 5.3 Database Layer ...... 31

1 5.4 Presentation Layer ...... 33 5.5 Deployment ...... 34 5.5.1 Application Preparation ...... 34 5.5.2 Nginx and uWSGI Deployment ...... 35 5.5.3 Apache and uWSGI ...... 36 5.5.4 Alternative Deployment ...... 36 5.6 Extending the Application ...... 37 5.6.1 Adding New YARA Rules ...... 37 5.6.2 Changing the Configuration ...... 37 5.6.3 Adding New Operation ...... 37 6 Testing ...... 39 6.1 Unknown Malware ...... 39 6.2 Kordeef Trojan Horse ...... 40 6.3 Infected Word Document ...... 41 7 Conclusions ...... 43 7.1 Future Improvements ...... 43 7.2 Conclusion ...... 44 A Attachments ...... 51 B Contents of attached ZIP archive ...... 59

2 Chapter 1 Introduction

Nowadays it is hard to imagine a world without computers. They are used in banking, medical, various control and communication systems, business application etc. This IT era gives us a lot of new possibilities and brings improvement in efficiency, but also brings security risks. The best solution of these risks would be to completely prevent intrusions and attacks to the systems. However, in real world, despite using various defensive and safety mechanisms, some secu- rity attacks are successful, system is infected and it is necessary to solve this security incident by analyzing it.

Malware analysis is dissecting the malware to understand how it works, how to identify it, and how to defeat or eliminate it. [39, p. 29]. There are two types of analysis techniques. Dynamic analysis involves launching and also debugging an executable file in a con- trolled and monitored environment so that its effects on a system can be observed and documented[5, p. 287]. Static analysis includes loading the executables into disassembler and examing program in- struction to discover what the program does. In this thesis dynamic analysis is not in my concern. I will mainly focus on static analysis.

In static analysis the disassembled executable is not often readable, because malware authors usually used some techniques to mask the code and to hide their targets, for example, common encryption, en- coding or obfuscation. Thereby in the beginning of analysis it is nec- essary to convert the executables to the decoded forms. It is more efficient and comfortable for security analyst to use some automatic tool to provide this conversion.

3 1. INTRODUCTION

In my work I will implement the tool which security analyst could use during static analysis to obtain the decoded disassembled exe- cutable. In the tool the user will have option to completely manage the process of analysis. There will be possible to choose type, pa- rameters and also order of the functions used to decoding the given source. In addition to the set of predefined malware detection pattern rules, the tool also will allow to define and use only custom pattern rules. This tool will be implemented as intuitive and user-friendly web application. The application will be developed in collaboration with a global Computer Incident Response Team (CIRT) at Honey- well International Inc.

4 Chapter 2 Malware Overview

2.1 General Malware Categories

Malware is any software that does something that causes harm to user, computer, or network [39, p. 29]. Software which is not primary harmful, but performs potentionally unwanted actions, such as gath- ering sensitive information, is also often considered as malware. The most common forms of malware are executable programs, the oth- ers include scripts, applets, plugins, etc. Malicious software can be categorized into following groups:

• Rootkits modify the so that they are capable of hiding themselves and other system components from users and even the operating system itself [5, p. 309] . They can op- erate in a kernel mode (Ring 0 on MS Windows systems) or userland (Ring 3) mode.

• Backdoors have primary function to create remote access to the compromised system or network. Most of them are used dur- ing direct attacks to ensure access to the victims and for data exfiltration. They are often combined with rootkits.

• Downloaders exist mainly to download other malicious code. They are usually a part of a botnet ecosystem that is available as a service for other malware authors who can pay the botmaster (owner of a botnet) to spread their malware.

• Spyware collects sensitive information. Collected data often in- cludes information about bank accounts, user behavior, brows- ing history and user accounts.

5 2. MALWARE OVERVIEW

• Worms use self-replication for rapid infection spread. They do not modify files, but create a copy of itself to spread via security vulnerabilities. They may cause some harm, but they are not designed to modify systems.

• Trojan horses infect other executables to spread infection. They usually carry other malware payload and spread by modifying existing files, for example, executables, documents or scripts.

• Adware displays unwanted advertisements, often classified as Potentionally unwanted Program (PuP) which is not intented to be malicious. It is often installed as a part of legitimate pro- gram.

• Ransomware tries to obtain money by threating users. Most of the ransomware display a notice from the police to pay ransom to avoid legal prosecution because of illegal activity or notice that user’s important files have been encrypted and he must to pay the malware author to decrypt them.

• Bots connect to botnet to receive commands from botmaster. An infected computer becomes a zombie in a large botnet con- sisting of other infected systems receiving commands from bot- master. A botnet is usually used for DDoS (Distributed Denial of Service) attacks, sending spam, minning bitcoins, spreading other malware and so on.

It is difficult to exactly categorize malware, because it is often combination of several categories mentioned above.

2.2 Malware Lifecycle

A typical malware lifecycle begins by using a delivery system to spread the malware to the potential targets which is most often in a form of a script on a compromised web site. This is done by ex- ploit kit, set of exploits - the pieces of software, the chunks of data or the sequences of commands, that take advantage of vulnerability or a bug in another software to execute attacker intended instruc- tions. It can cause unusual behavior in the vulnerable software [40,

6 2. MALWARE OVERVIEW

Figure 2.1: Malware lifecycle stages [1] p. 191]. Exploit kit tries to detect outdated plugins/software of po- tential targets and exploit them to deliver malware. Another type of delivery can be done by sending phishing emails with malicious pay- load, such as infected PDF or ZIP archive containing executable file with double extension.

Next stage is to deliver malware by exploiting victim and ensur- ing persistence to survive reboot of the compromised host. After the persistence is ensured, the malware begins to perform actions which has been designed to. Actions may differ per malware, but usually include establishing a callback (C2 channel) to the CnC server to re- cieve commands or data exfiltration.

Malware lifecycle can be often different, for example, delivery sys- tem can be omitted, because it is deployed by another malware or by a stager. The stager is responsible for downloading a large payload (actual malware), injecting it into memory, and passing execution to it. By using a stager, it is almost guaranteed that the malware will be deployed on the victim host.

7 2. MALWARE OVERVIEW 2.3 Examples of Malware Lifecycle

Zeus [30] (trojan horse) and CryptoLocker [23] (ransomware) - A victim browses to the web site infected by Blackhole exploit kit [21] (well-known exploit kit). By opening this infected web site, the ex- ploit kit is executed (in form of JavaScript) and then it detects that the victim has an outdated Java version. Blackhole then exploits a vul- nerability in that outdated Java by using applet to deliver the Zeus malware to the victim. Zeus ensures its persistence and connects to its Command and Control Center (CnC) from which it receives com- mands, such as collecting sensitive information, for example, bank- ing account information, or delivering other executables. The author of CryptoLocker paid to the owner of Zeus botnet to deliver his mal- ware. In this case exploitation phase of CryptoLocker is ommitted, because it is deployed by another malware, in this case Zeus.

On the side of target there is often a running antivirus or a de- ployed Intrusion Detection System (IDS) which are able to detect malware by using signature. So malware authors usually have to bypass the signature detection by using encoding or obfuscation of original executable or communication channel to be successful and meet the objectives. In the malware lifecycle it is beneficial to use ob- fuscation in several stages. In the following example these stages are closely described.

2.3.1 New Malware Sample Example Scenario

The programmer has created a new malware whose aim is to steal sensitive information. The first step is to infect as many victims as possible. To accomplish it the hacker deploys exploit kits to web sites which he has compromised before. At this point if he does not use some sort of code obfuscation for the deployed exploit kit, it is very likely that it will be spotted. For example, Google scans for malicious code during indexing the web pages and if some is found on a web site, it will warn users when they try to open that site. And that is not the intention of malware author, because it might greatly decrease the number of infected victims.

8 2. MALWARE OVERVIEW

Now the first victim (patient zero) browses to this site. The exploit kit is executed and then it detects that this victim has an outdated version of Java. A malicious applet is automatically inserted into the web page and then it exploits the old Java version and delivers mal- ware. This victim connects from the company network where IDS or web proxy are deployed. The victim also has an up-to-date antivirus installed. These two tools can easily detect known exploits and alert the user and thus exploitation would fail and victim would not be infected. At his point, obfuscation can be used to bypass this defense mechanisms.

Malware most often drops some files on disk when ensuring its persistence. Many antivirus programs automatically scan files (OAS, on access scan) which are manipulated (created, modified). This can be another point of failure for malware when it tries to deploy, there- fore obfuscation can prevent detection in this case.

Malware performs its objectives, for example, it collects victim’s credit card number and login information to e-banking. After col- lecting this information, malware sends it to its CnC to be used to steal money from the bank account. When information is being sent to CnC, the communication is monitored by IDS or antivirus which can match the pattern that looks like credit card number so that the user will be alerted/blocked. Communication can be obfuscated, so it cannot be easily seen what is being sent to CnC and thus it also avoids detection.

When the malware is finally detected, it is then analysed by secu- rity researcher. During the analysis he disassembles the executable into assembler , and then he reviews the code to learn about the malware behaviour and its objectives. In plain text it may be easy to spot strings, such as credit card number regular expres- sion, CnC IP address and so on. Having this information it is easy to block that malicious server, create a signature to detect this mal- ware and make other steps to stop it, which significantly decrease malware spreading. The malware author wants to infect as many vic- tims as possible, so he uses obfuscation to encode the malware body.

9 2. MALWARE OVERVIEW

Then it is more difficult and it takes more time for the researcher to find what this malware sample does, because decoding the malware body into some readable format to analyse it is usually long process considering a lot of different ways of encoding.

2.4 Obfuscation

Obfuscation is a process of data manipulation to hide its intended purpose, conceal communication or obstruct its interpretation. In IT, the obfuscation is most frequently achieved by using some simple functions which transform data, such as XOR or byte-shift. These simple transformations of data are often enough to break signature detections. Another type of obfuscation is a generation of equivalent code from its original form in a way that is very hard to interpret by human, but this type of obfuscation is not aim of my thesis.

Using an obfuscation is a two-way process. On one side there is a function performing the obfuscation by using predefined algorithm with a key. On the side of receiver there is also a function and it per- foms deobfuscation of the data by using algorithm that is a reverse function which produces original data from its obfuscated form.

Nowadays there exist a lot of methods to encrypt data in a way that is almost impossible to decrypt without having the decryption key, for example, AES (Advanced Encyption Standard), RSA, ECC (Elliptic curve cryptography), which are widely used in common software, such as browser, email clients, p2p and so on. Malware au- thors can use them to create encrypted data, which cannot be eas- ily decoded. However, these methods are not widely used, because they often require including additional libraries and make malware code more complex, so it grows in size. Adding 3rd parties libraries also exposes some of the functionality of the malware and bigger size makes it more noticeable which is not in the interest of malware authors. Because of that they usually use simple functions, such as XOR, AND, ROT, ADD/SUB and byte-shift. These simple functions have native support in almost every programming language, so they are easy to use and sufficient for the purposes of obfuscation.

10 2. MALWARE OVERVIEW

• XOR - Exclusive or, the most common obfuscation function that is used. When this operation is performed on a byte with a given key, it can be reversed by applying the same opperation with the same key. This function operates on a binary represen- tation of a byte and a key by comparing each bit and returns 1 whenever these bits differ. By this definition it is clear that by applying the same key again we get the original number and this is why it is very popular, because the same function that obfuscates the data can be also used to deobfuscate it and thus it results in a smaller code.

• ADD/SUB - This is an addition or substraction by a given key. These functions are used in a way to handle overflows/under- flows and thus prevent information loss.

• ROT - It performs a rotation of bits from the binary represen- tation of a number to the left or the right by a given key. This is not a classical arithmetical/logical shift where the bits may be discarded on both ends, but instead they are carried over hence the name rotation.

• AND - It combines the bits from binary representation of a number with a key by using a logical AND operation. This function is not commonly used, because it is not reversible from the mathematical perspective and for the deobfuscation it is necessary to carefully select a key to be able to reverse it and get the original information that has been obfuscated.

There are also a variations of the functions described above that change the encoding key for each byte. A common enhancement of the obfuscation algorithm is to increment the encoding key by a de- fined value after each byte. It fools a lot of tools for automatic analy- sis, because most of them expect that the same key is used to encode whole data. I also implemented a support for incremental keys in my tool to address this problem.

11 Chapter 3 Analysis

In this chapter I will describe requirements and specification of the system. I will also compare my framework with other existing tools.

3.1 Overview of Existing Tools

Some tools, that address a problem of automatic deobfuscation, have already been developed. In this section I will describe the tools I found during the research.

XORSearch [42] supports deobfuscation of XOR, ROL, ROT and SHIFT encoded data by brute forcing all possible combinations and looking for a given input string. If the given string is found by de- obfuscation with the specified key, XORSearch can extract all other strings using that key.

XORStrings [42] is a combination of XORSearch, described above, and the usual string format ended by null byte. After it runs a brute force on a provided input data for each operation and key, it searches for all strings in decoded data. Its generated report displays infor- mation about the number of found strings for each key, their average and maximum length.

xorBruteForcer [13] is a very simple Python script that brute forces all one byte keys (if key is not provided) to search for a given string. After the string and corresponding key are found, it outputs all other strings found by deobfuscation with the key. It is almost the same as XORSearch, but it supports only XOR function.

12 3. ANALYSIS

iheartxor [19] also brute forces XOR using all one byte keys. No string needs to be provided. It searches only for strings that end with null byte by default.

NoMoreXOR [11] attempts to automatically guess a 256-byte long XOR key. Decoded data is scanned via the pattern matching system YARA [2]. When a decoded content matches YARA [2] signatures, a potential key is found. Matching rules can also be custom provided.

Balbuzard [26] consists of several modules: bbcrack, bbtrans and bbharvest. Bbharvest extracts all patterns during brute forcing from input data. It is very useful when more than one function is used, because it combines several operations (XOR, ROL, ADD). Bbcrack tries to guess which algorithm has been used to obfuscate provided input data based on patterns. Bbtrans can apply obfuscation algo- rithm, that has been found by bbcrack, to a file. Patterns are identi- fied using YARA signatures.

In contrast with most of the existing deobfuscation tools support- ing only XOR operation, I decided to implement deobfuscation of XOR, AND, ADD, SUB and ROT encoded data in my tool. For all of these functions, it is also possible to increment a key after each byte. In my application, the user can set a custom configuration and combination of deobfuscation methods, such as range and increment value of keys for each method. I did not find any tool that sup- ports this kind of custom settings. My application uses YARA rules to scan for signatures to determine deobfuscation algorithm and key, no string to search for is required to the contrary to some tools.

All tools, I discovered, operate in command line mode and do not have any graphical interface. I implemented my tool as a web ap- plication which is more intuitive and easier to use than CLI applica- tions. Although CLI applications are less user-friendly, their advan- tage is that they can be scripted in shell. My interface uses an API which can be also used to integrate my application into other systems or scripts. Other advantages resulting from design as a web applica- tion, it is installed only on a server, so the user needs only a browser

13 3. ANALYSIS to access this application and also data is shared among the users. It is also easier to update the tool and fix bugs for the maintainer.

3.2 Specification of Requirements

The main aim of my thesis is to create the tool which facilitates malware analysis. It should provide a deobfuscation of input data and identifying the patterns to determine whether it is malware or not. The tool should allow custom settings of analysis parameters, such as the deobfuscation methods, their keys and so on. It should also provide automatic malware analysis where the configuration is automatically generated.

3.2.1 Functional Requirements • The user inputs data into the system by uploading a file, or in a plaintext, which can be optionally formatted as Base64 en- coded string. The Base64 encoding option should be also auto- matically offered to the user in case the input text is detected as Base64 valid string.

• The user should be able to configure deobfuscation steps. He chooses types of deobfuscation functions, the order of their ap- plication to the original sample and also range of their keys for brute force cracking.

• In the tool the user can choose to use set of predefined rules to analyse or define own rules.

• After submitting the analysis, an interactive progress showing the current status of deobfuscation, which is in progress, is dis- played to the user.

• When the analysis is completed, list of the results, that have been found during brute force cracking, is displayed to the user where he can open a detailed report for each result.

• In detailed result report from the analysis, the user is able to download both the original sample and the deobfuscated file

14 3. ANALYSIS

that has been created. There is also a list of the recognised rules with the location of the data, that has been matched by them, in the report.

• On the home page there is an overview of the latest analysis results.

• The user can terminate a specific deobfuscation task that is still running by removing it at the home page.

• The user does not have to specify analysis settings, it is possible to choose automatic analysis option where settings are selected from previous successful results.

• A simple administration interface, where the user can see list of online cluster nodes, reset the database and terminate all run- ning tasks in the cluster, is available.

• The tool should allow to run more analyses simultaneously.

• Configuration file of the application allows to set the expiration time for deobfuscation results in the database.

3.2.2 Non-fuctional Requirements • The tool is implemented as a web application and thus no spe- cial client application has to be used.

• No user accounts, there is only one global user role without authentification. This is by design, because the application is not intented to be available as a public service, but rather to be used by analysts in their own environment or lab.

• The tool is scalable both vertically and horizontally by adding more nodes to the cluster or changing the configuration of a node.

• Stored data may be inconsistent or lost when the database is not cleanly shutted down, because it keeps almost all data only in memory.

15 3. ANALYSIS

• The application is platform independent, because it runs as the web application and can be used from any platform that has a modern web browser.

• The application is an open-source, thus can be used for free in company environment. Other features or bug fixes can be contributed by others.

16 Chapter 4 System Design

In the following chapter I will discuss the different technologies, that can be used to implement the tool, and their requirements and advantages/disadvantages.

4.1 Available technologies

The tool should be implemented as a web application. There are four most commonly used languages to develop web applications: PHP, Python, Java and Ruby. PHP is the major language to create a web applications, however, as the programming language for imple- mentation, I chose Python. I used Python because of the following reasons:

• There are many high quality web frameworks that can be easily integrated with the cluster frameworks to execute job running in the background.

• Existing cluster frameworks are very easily configurable and have a high scalability without a lot of requirements.

• Every commonly used database has a connector for Python, and a lot of them can be used as a broker for a cluster.

• Majority of existing tools for the deobfuscation are written in Python, because of its flexibility.

• Python has been primarily created as a language for the data and the strings manipulation, their transformation and the pro- cessing large volume of data for scientific reasearch, so it is very

17 4. SYSTEM DESIGN

convenient to convert data to various formats and do various processing.

4.1.1 Web Framework In Python there are many different web application frameworks with their own advantages and disadvantages. Probably the most popular framework nowadays is . This framework is very ro- bust, because it includes a lot of layers, that are needed to build a big scale applications, such as Object relation mapping (ORM) in- terface, separate components for the application parts, which act as plugins that can be reused across other application, Model-View- Controller (MVC) approach, automatic generation of administration. Many other Python libraries provide direct integration with Django. One of the disadvantages of the Django framework is that all of these components are installed by default and developer cannot easily re- move them, which results in unnecessary dependencies and perfor- mance drawback. That is why I did not choose Django, because in my application I do not use many of the features that it provides, such as ORM, MVC approach and so on. Based on these facts, I decided to choose a more lightweight framework (a so called microframework).

There is a wide range of microframeworks for Python to choose from, for example CherryPy [29], web.py [43], Flask [37], [20] and so on. These microframeworks are almost equal to each other regarding the features they provide and in general they just differ in syntax used to create a web application. From these frameworks I de- cided to use Flask. It is quite popular, very well tested, field-proven, easily deployed and has an excellent documentation. In Flask, cre- ating a web application is as easy as decorating a function which returns data displayed on the client side.

4.1.2 Database In general, databases can be categorised into two groups: SQL and NoSQL. At present SQL databases are the most popular. SQL schema is standard, well-defined concept, based upon rock-solid theory, ma- ture and well-understood. However, NoSQL databases are gaining

18 4. SYSTEM DESIGN popularity these days, because a carefully selected type of NoSQL database can provide a great capabilities. Other benefits that this type of databases provides are map-reduce functions, advanced aggrega- tion, no table scans, better locking mechanism, cheaper scalability, etc.

I chose to use NoSQL database Redis [38]. One of the primary reasons is based on that my application should provide an almost real-time overview of the current deobfuscation progress. Having this real-time feature involves writing the current status from the cluster to the database very often, which may present a problem in SQL database due to the locking mechanism, because web client also fetches that progress from the database almost real-time. The other reason is that the data that my tool stores in the database is un- structured. I also do not use aggregation in my data, such as group- ing records together to calculate extensive overall statistics from the stored data. It is important to notice that this database can be used as a broker for the cluster, which minimizes dependencies on the other software that will be necessary to use and manage. I will describe the broker part later in this chapter. Redis is also a very good choice as a caching system which I use in my application.

4.1.3 User Interface

I decided to implement my tool as a web application, because the creation of an additional client side application is as difficult as creat- ing a server for deobfuscation and it is more inconvenient for the user to install the additional client application to be able to use my tool. If I had created a client side application, I would have had to take care about a compatibility with a server when adding new features, maintain more software, which results in more bugs, and develop some sort of update mechanism to deliver new features. Nowadays web technologies (HTML5) provide almost all the features and the comfort of the standard desktop application. Web application is also easy to use, because almost every one has a browser.

19 4. SYSTEM DESIGN

However, writing a web application may not be an easy task, be- cause the developer has to ensure that it looks the same in the dif- ferent browsers, devices, such as tablets and phones, and also differ- ent screen resolutions. Fortunately, now there exist frameworks that address these problems by providing the templates used to create a web application design that looks same across different devices and browsers. Some of the most popular frameworks are Bootstrap [31], LessFramework [25], Skeleton [17]. These frameworks are very sim- ilar to each other in a way they provide a pre-created stylesheets for the design and documentation on how to create a web application by combining widgets and components together. I decided to use prob- ably the most common framework - Bootstrap.

Bootstrap works by using a base HTML5 template with included CSS stylesheet and the interface is built on the top of a grid system where the application is divided into rows which contain columns where the widgets displaying actual data are placed. Documenta- tion for the Bootstrap describes the HTML code that can be used to create particular widget, such as navigation bar, collapsible panel, responsive table and many other components.

However, using only HTML and CSS stylesheets is not enough to create an interactive application, because that just provides a design to display data in some pretty format. For the interactivity, it is neces- sary to use a scripting language (JavaScript) which allows to dynami- cally respond to user actions interactively without constantly refresh- ing the web application to be able to display different data based on the user action. JavaScript provides the interface to manipulate data, fetch data from the server and modify different areas of the web ap- plication accordingly. Using only Javascript is a similar problem like creating a custom design, because there is also a problem with in- compatibilities between different browsers, platforms and devices.

Because of that there exist different frameworks which solve this problem by providing an abstraction layer with the same interface for the commonly used browser features. One of the most popu- lar frameworks is jQuery [36] which provides CSS-like selectors for

20 4. SYSTEM DESIGN

HTML elements, Ajax calls (abstraction over XMLHttpRequest) and so on. Using this framework greatly speeds up the development and reduces size of the source code, because of the convenient functions it provides. However, one problem still persists and that is to reflect changed data in interface, because the developer has to search pro- gramatically for all the locations where this data is displayed and has to modify each of the location to display the changed data.

This problem is addressed by MVC frameworks, where data is stored in models, transformed in controllers and displayed in views, which display actual data even it has been modified. There exist sev- eral JavaScript MVC frameworks, such as Ember.js [12], AngularJS [28], Backbone.js [3] and so on. I decided to use AngularJS, because of its rapid prototyping and easy to use API. This allowed me, for ex- ample, to display an almost real-time progress of the deobfuscation and to create very interactive responsive interface. AngularJS also provides a very good integration with the Bootstrap framework that I use as a base for the design of the application.

4.1.4 Parallel Processing

My application should be able to run more deobfuscations simul- taneously. A standard program can execute only one instruction at the time, which may represents a problem, because when we run a deobfuscation process, we cannot do anything else until the de- obfuscation function is finished. This is by design, because a nor- mal program runs in a single thread which cannot execute parallel instructions. A simple solution to this problem is to create a multi- threaded application so that the deobfuscation function runs in a separate thread, thus it allows the main program to perform different actions in the same time. When using a multi-threaded approach we have to implement some sort of locking mechanism by using locks and semaphores. This solution seems to be very easy, but there are also a few big drawbacks - it is hard to scale, because it can use only one core of the processor and creating a thread in Python does not re- ally make application use a true multi-threading as we know because of the Global Lock(GIL).

21 4. SYSTEM DESIGN

The second approach is to use a multi-processing like creating a separate process or forking. This approach scale well within single host, because by creating a separate process, it can take advantage of the multi-core system. But it has also drawbacks - it is difficult to communicate between processes which run in their own separate context. Communication between the processes can be done by using sockets, file descriptors or Inter Process Communication (IPC).

The last approach for the parallel processing is to use cluster that consists of workers which run on different machines and communi- cate via broker or sockets. These workers receive tasks which need to be executed and when the execution of the task finishes, they will return the data from the evaluated function. This is a very scalable approach, because it is distributed among the machines connected to the cluster (nodes) and can also use advantages of the multi-core systems if they use forking or separate processes to execute tasks. Cluster can be scaled by adding new nodes which can be done dy- namically by the system if the load is too high. These cluster sys- tems are complex to create from the scratch, so the developers often use some existing engines for the cluster. Python has several popu- lar cluster engines, such as dispy [32], RQ [10], Celery [41] and so on. The difference between these frameworks is that they use differ- ent approach to distribute tasks in the cluster, for example, message broker, SSH, sockets and so on.

For my project I selected Celery framework which uses message broker to distribute tasks and messages across the cluster. A message broker in Celery can be a database, for example, MySQL [8], Redis, MongoDB [22], CouchDB [15], or a messaging server from which the most popular is RabbitMQ [34]. A full support for the features in Celery, such as task expiration or advanced events, is currently sup- ported only if the Redis or RabbitMQ are used as a broker. As men- tioned above, this is also one of the reasons why I selected to use Redis as a database, because I also use it as a broker for the clus- ter and that eliminates dependencies on another software. A broker basically only distributes messages.

22 4. SYSTEM DESIGN

4.1.5 Deobfuscation Heuristics During the deobfuscation process the application needs to deter- mine if the right deobfuscation algorithm has been found. This can be done by creating a database/list of patterns, and then by searching for these patterns in the data. A simple solution is offered here, cre- ating a list of regular expressions. This approach, however, has a few drawbacks - the regular expressions are designed to search patterns on the top of the readable text, not a data that consists of unprintable characters. Other drawback is that searching for regular expressions is relatively slow operation in a huge amount of data.

There exists a framework called YARA [2] (see Listing A.1 for ex- ample of YARA rule) which is originally designed as a pattern match- ing engine for malware categorization. This is a very similar engine that antiviruses use for signatures, however, it is not only limited to malware, and any type of file can be categorized. YARA works by creating the rules consisting of patterns which are searched for on the top of a file, and any matching data sequence is returned. Patterns can have many forms like a simple strings, regular expres- sions, group of bytes and options for these patterns, such as range of bytes where to search for, wildcards and pattern combinations (ex- clusion, inclusion). I use this engine in my application as a heuristics to find out if a deobfuscated data is found by searching for a defined patterns with assigned weights, and then suming matched pattern weights to get a total rank of a resulting data.

23 Chapter 5 Implementation

In this chapter I will describe user interface, project structure, used technologies, database structure and also deployment process in de- tails.

5.1 User Interface

5.1.1 Home Page On the Home page (Figure A.1 ) the list of analyses is displayed. Each analysis contains its status - failed, pending or finished, and creation time. There are buttons to open or remove analysis on the right side. By default all sessions expire after seven days, but it is configurable via configuration file. If the analysis is removed and is still running, it will be terminated.

5.1.2 Administration On the Administration page (Figure A.2 ) there is a simple admin- istration interface. A user can see the list of online cluster nodes and can terminate all running tasks in a cluster and remove all data from the database which will eventually reset the application.

5.1.3 Analysis Configuration The first step in creating a new analysis (Figure A.3 ) is to input data to deobfuscate. This can be either done by uploading a file or by inserting a text. A user selects the type of input data from the dropdown box. The user can select from dropdown box whether in- serted input text is in a plaintext format or it is Base64 encoded data.

24 5. IMPLEMENTATION

However, inserted text is automatically scanned by JavaScript regu- lar expression if it is a valid Base64 encoded data and if it is detected, a hint for the user is displayed. By default a set of built-in YARA rules are used for the analyses, but the user can also select from the dropdown box to use his own YARA rules. The user rules are also automatically scanned for syntax errors.

The user can choose the deobfuscation function from the provided list. After adding the function, a range of keys can be set via a slider and also the value used to increment key after each byte. There is no restriction on the amount of added functions. The automated ap- proach for analysis is also available by enabling the checkbox for automated analysis option. This automation includes XOR function with all keys and it selects all functions with keys that have found any results from previous successful analyses.

When the analysis is submitted, a real-time progress bar is shown. If any results are found during deobfuscation, they are displayed im- mediately in a table below the progress bar. Each result has a rank score which determines how many artifacts have been found during deobfuscation. A detailed view of each found result can be openned via button.

5.1.4 Details of a Result On the detail page (Figure A.4) of each result the user can down- load the original file and also the deobfuscated file. This page dis- plays MD5 hash of the file and the creation time. For each matched YARA rule there is a panel with name inherited from the rule de- scription displaying the result of matched rule. Each row contains a string or sequence of bytes that matches the rule as well as the offset from the beginning of the file.

5.2 Application Layer

The application has the following structure:

25 5. IMPLEMENTATION

templates/ folder contains the templates for the Flask applica- tion which render the web interface.

static/ folder has three subfolders: css/, fonts/ and js/. The css folder contains all the stylesheets for the main Bootstrap theme as well as stylesheets for the custom widgets used in the interface, such as slider. The Bootstrap theme also uses browser compatible fonts which are located under the fonts folder to ensure the uni- form look across the multiple platforms. JavaScript source code is located under the js folder. This folder contains all the plugins for widgets that require JavaScript and the main module for AngularJS which contains controllers for different views is also located there. This main module is located under js/ng/mainapp.js

rules/generic.yara is file containing all the rules for YARA engine.

project.py is the main module which contains the Flask appli- cation with all the functions binded to the URL addresses.

crackers.py includes all functions that run inside the Celery cluster. They are used for automated and brute force deobfuscation.

decoder.py module contains all deobfuscation operations sup- ported by application. There is also a Python decorator that is used to wrap these functions so that they can be executed in a chained iterator.

utils.py consists of simple helper functions that are used in ap- plication multiple times.

heuristics.py contains a class that is used to calculate a score for each result from the YARA rules and to load a set of rules.

26 5. IMPLEMENTATION

config.py is the configuration file for the application and the Celery cluster. Redis database location URI is located here. It has to be set during the application deployment on a production server. Ex- piration value for the sessions created in database can be also set here in a seconds.

5.2.1 Flask As it was mentioned in the previous chapter, the framework that I chose as a web application engine is Flask which is categorized as a micro-framework. This family of frameworks provide only the basic functionality that is needed to build web application, such as URL routing, template engine or simple session management sys- tem. However, they usually can be extended by various plugins to provide more advanced functions and layers included in a full-stack web application frameworks, for example, ORM, database backend etc.

A typical web application written in Python begins by creating the main application object that serves all the incoming requests. In case of Flask it is done by creating an instance of flask.Flask class. The Web Server Gateway Interface (WSGI) engine imports the module containing this application and passes all the requests to this object. This design pattern is common for all Python web applications. Flask consists of two main parts: Werkzeug and Jinja. Werkzeug is respon- sible for the serialization and deserialization of the HTTP requests into Python objects. Jinja is the template language used to render the response for the HTTP request. A simple Flask application can be created as illustrated below. import f l a s k

app = flask.Flask() # main application object

# route decorator is used to bind function to URI @app. route(”/”) def index ( ) : # response returned for the http request

27 5. IMPLEMENTATION

return ” Hello world”

5.2.2 Celery Celery is a full-featured framework to easily create cluster tasks in Python. The first step in creating a cluster is to create a main appli- cation object of the celery.Celery class and set a cluster configu- ration. A broker configuration is very important, because it defines a cluster capability. As I mentioned in the previous chapter, not all broker types support all the features. Currently, only RabbitMQ sup- ports all of them, and Redis is the second most supportive. All the communication in a cluster is done via a broker which is responsible for the delivering and routing the messages between the nodes.

When a cluster node is started, it connects to the broker and starts respoding to received messages. These messages include events such as heartbeat - used for monitoring a health, or tracking the progress of executing tasks. If a task is created, a message is sent to the broker with the specified configuration, such as the task that needs to be executed with its parameters, and optionally other settings, such as expiration time. A node with a free resources grabs this message and starts executing this task. When the execution is finished, a message is sent back to the broker with the task result and client can grab this result.

5.2.3 Creation of a New Deobfuscation Analysis When the analyst enters all the information into the web interface, it is briefly scanned with JavaScript to check that it contains all the necessary details for the analysis. At this stage I identified a problem regarding how to correctly and easily pass the data, that needs to be deobfuscated, to the server including a support of reading that data from the file on the client computer. A common approach is to create a HTTP form including input element of file type, but it presents a problem that is hard to make this form interactive, because the form needs to be submitted which is generally considered as a page refresh.

28 5. IMPLEMENTATION

However, the new specification of HTML5 includes the capability of loading a file content via JavaScript. This specification provides a FileReader interface which can be used to asynchronously read file content using the events. During my research I found it requires a lot of code to accomplish a goal to send a file via Ajax. I instead chose a third party library called FileReader.js [18] which is merely a wrap- per around the FileReader interface. This library provides us with an event that is fired when the input element of file is changed includ- ing a function to load a file content into a buffer. Apart from this, it also includes built-in support for drag and drop without writing additional code.

Prior to the submission of new analysis, data is serialized to JSON with the following structure: { ”analysis”: ””, ” data ” : ””, ”dataType”: ””, ”encoding”: ”

”, ”name” : ”<name of the a n a l y s i s session >”, ” r u l e s e t ” : ”<YARA r u l e s : custom/default >”, ” r u l e s ” : ”<custom YARA rules >”, ”functions”: [ { ”name” : ”<function name : xor/rol/add/and>”, //#range of keys for bruteforce , ”config”: [0, 255], //#value for incrementing the key, ” key change ” : 0 } ] }</p><p>This JSON is submitted to the API endpoint located at /crack which is then processed by a Python function. This function spawns a new task in a Celery cluster, for either brute force or automated analysis based on configuration in JSON.</p><p>29 5. IMPLEMENTATION</p><p>At first this function computes a total number of operations, which is used to display a progress of deobfuscation, and also creates an instance of heuristics.Heuristics class, which is later used to compute a rank score for each brute force iteration. By default, every rule has a weight value of one, but this can be modified in a rule by adding a weight parameter with a specified value in meta section of a rule.</p><p>The next step is the preparation of a list of operations that will be executed in each iteration. In case of brute force a chain of oper- ations is constructed from the provided parameters from JSON us- ing the decorator wrapper located as decoder.chain function . This decorator is responsible for chaining functions in a way that all yielded results are provided as an input for the next function in a chain. This whole construction acts as a Python generator that yields results from the last function in a chain. Generator was chosen in- stead of a classical return statement in a FOR loop, because it is much more efficient considering only a code in the current iteration is exe- cuted. If I had used standard return statement, I would have needed to precompute all the results and store them in an array which is very memory inefficient as opposed to the generator where memory us- age is approximately constant, because it only stores data of a current iteration.</p><p>When the user selects an automated analysis option, he does not need to enter any other configuration for functions, because the ap- plication tries to determine best possible options. This is done by se- lecting a XOR with a range of keys from one to two hundreds and fifty-five which is the most common type of obfuscation used in mal- ware. In addition to the XOR, the system also looks for all the previ- ous successful analyses and adds all the functions which found any hit. This list of operations is also filtered in a way that the same op- eration is not added there more than once.</p><p>Analysis task then runs a YARA heuristics on every yielded re- sults determining a score rank. If the rank is non zero, it will be stored in database along with other information, such as a sequence</p><p>30 5. IMPLEMENTATION of operations that led to this result and details of YARA matches. The progress is incremented in Redis after each iteration.</p><p>On a client side AngularJS creates an Ajax request to the API end- point located at /crack/<taskId> which returns the progress of the deobfuscation task and also all the results that have already been found. The taskId parameter is the ID of a Celery task created in a cluster. The first version of deobfuscation tool created an array that was populated by results found during deobfuscation and returned as a result of a cluster task. I found out it presents a problem, because the results could be displayed only after the deobfuscation finished which may take a long time. I decided to redesign it to insert the result into the database right after it has been found, so it can be dis- played in a UI before the whole task finishes.</p><p>5.3 Database Layer</p><p>The primary data structure is a dumped JSON stored as a string under the specified key. Redis provides a structure called hash map which is very similar to a Python dictionary with a restriction that it cannot contain nested structures. That is why I decided to use a dumped JSON instead, because it contains nested structures. One may argue that this provides drawback, because it is not possible to query a nested data in a structure. However, even using native hash maps, there is no operation in Redis to query sub-keys of this map.</p><p>The analysis session data is stored under the session-<id> key where id is an id of Celery task (all other id placeholders in text also refer to the Celery task id) that has been created for this analysis. It mainly stores JSON that has been received via API from the interface. This data has the following structure: { ”analysis”: ”<automated or b r ut e f o rc e >”, ”dataType”: ”< f i l e or text >”, ”name” : ”<name of the analysis >” ” f i l e ” : ”<md5 of stored data>”,</p><p>31 5. IMPLEMENTATION</p><p>”functions”: [ { # range of keys for operation ”config”: [0, 255], ”name” : ”<name of an operation : xor/add/and... > ”, # optional increment value for key ” key change ” : 0 } ], ” r u l e s e t ” : ”<d e f a u l t or custom>”, ”sessionId”: ”<id of a c e l e r y task e . g session id>”, ” time ” : ”<c r e a t i o n timestamp>” }</p><p>I used a separate key for storing a progress of the analysis under the key which is equal to the id of the Celery task. This data is stored in a native hash map, because Redis has the operation to increment a sub-key in this structure which is in following format: { # Number of operations processed ”current”: 500, # Total number of operations ”total”: 731 }</p><p>The results, that have been found during analysis, are stored as a list of dumped JSONs under the result-<id> key. Each of the JSON has the following format: { # set of operations made on original data ”operations”: [ #[name, key, increment value] [”xor”, 108, 0] ], # Human readable string of operations ”operations str”: ”xor(0x6C) key change : +0” ,</p><p>32 5. IMPLEMENTATION</p><p>”rank”: 100, # Rank score of the result ”md5” : ”<md5 of the deobfuscated data>” }</p><p>Using the MD5 hash it is possible to retrieve the details of deobfus- cated file including YARA matches using the key decoded-<md5> which has the following format: { ”matches” : ”<output from YARA>”, ”rank”: 100, # Score rank of deobfuscated file ”md5” : ”<md5 of deobfuscated data>” }</p><p>A content of the file can be retrieved using the MD5 checksum by quering a key file-<md5> .</p><p>5.4 Presentation Layer</p><p>In my application I selected to use a combination of two template engines. The first one is a Jinja2 which is built-in Flask and it is used on server to generate a basic structure. On a client side I decided to use a template support included in AngularJS to create an interac- tive interface, because server side templates do not provide this level of interactivity. When the data is fetched via Ajax, the AngularJS’s MVC model automatically updates all the location where the data is displayed, for example, the application can display the real-time progress of a deobfuscation without the need of reloading the web page.</p><p>I customized the controller for AngularJS used in new analysis view to include functions that verify that all the needed configura- tion is correctly set for creating a new analysis, such as brute force functions and the file for deobfuscation. If any of the configuration is missing, a hint is displayed to the user describing what configuration is missing to create the analysis.</p><p>33 5. IMPLEMENTATION</p><p>During the interface development, I found there is no widget in Bootstrap that can be used to easily set a key range of selected op- eration. During my research, I found multiple external libraries, but none of them provides a sufficient support for AngularJS integration. In order to solve this problem, I created a custom directive for exter- nal library which monitors the slider values and updates the scope of AngularJS variables accordingly. The custom directive provides a new HTML tag that AngularJS replaces with a defined HTML struc- ture and initializes the external library to display the slider. This di- rective is included in static/js/ng/mainapp.js .</p><p>All the user controls interact with controller’s models which are simply variables in a scope. When some of these controls, such as input field, change the value, AngularJS will ensure that scope vari- ables will be updated to contain this new data. If a new Ajax request is created, the variables from the scope are serialized to JSON which is then sent to the server. Any response from a server, that is in JSON format, is deserialized back into scope variables and the interface is automatically updated by AngularJS.</p><p>5.5 Deployment</p><p>5.5.1 Application Preparation The recommended set up for Python is to use version 2.7 inside the virtual environment. Virtual environment provides a benefit of iso- lated environment for Python application where all the dependen- cies are installed, this is to prevent accidental breaking of application due to updated system level packages. This level of isolation ensures that all the Python dependencies are left intact even when the sys- tem versions are updated. No administrative privileges are required to create this environment, so it also provides improved security ben- efits. To create a virtual environment, run a command virtualenv venv at root directory of the application, and then it creates a virtual environment with a name venv . It is necessary to activate this envi- ronment to be able to make changes inside it, which can be done by executing a command source venv/bin/activate .</p><p>34 5. IMPLEMENTATION</p><p>We are now inside the environment as it is indicated by (venv) on the left side of the shell prompt. We can now proceed with the instal- lation of the required Python dependencies. All the dependencies are located inside the requirements.txt file and it can be installed by running a command pip install -r requirements.txt . Prior to the installation of Python dependencies, a library for YARA and Python headers have to be installed in OS. A Redis database is also needed to be deployed and a URI pointing to the database has to be configured in application settings in config.py .</p><p>A simple server can be run to ensure that all dependencies are in- stalled correctly by running a script project.py . It runs a built-in web server in Flask at port 5000 which is used during application development, but I strongly discourage to use this built-in server at production. A deployment for production server will be described later in this chapter.</p><p>The same preparation procedure as described above is also ad- vised to be used for setting up the cluster nodes. At least one node is required to be able to run the application. This node can be also lo- cated at the same server where the web frontend of application runs using the same virtual environment. A cluster node can be run by executing a script cluster.sh .</p><p>5.5.2 Nginx and uWSGI Deployment The recommended production set up is to use a Nginx [44] web server, which acts as a proxy with combination of uWSGI [45] as an engine for the application. I chose this combination, because Ng- inx provides excellent support ( nginx-full version, not the mini- mal set up) for uWSGI. The uWSGI can be installed by running pip install uwsgi command inside the virtual environment.</p><p>The uWSGI can be runned by executing a command uwsgi -s 127.0.0.1:3031 -w project:app --pidfile ./wsgi.pid --daemonize ./uwsgi.log at application root. It runs a uWSGI listening at port 3031 and logging to the uwsgi.log .</p><p>35 5. IMPLEMENTATION</p><p>Now the Nginx has to be configured to proxy all traffic to the appli- cation. This can be done by modifying the configuration as described below: l o c a t i o n / { t r y files $uri @deobfuscator; } location @deobfuscator { include uwsgi params ; uwsgi pass localhost:3031; }</p><p>5.5.3 Apache and uWSGI By default, Apache [14] does not provide a support to directly use applications running under uWSGI. However, this support can be added to Apache by installing and enabling two separate modules which are mod proxy and mod proxy uwsgi . The configuration on uWSGI side is the same as I described above for Nginx. The con- figuration for Apache is also very similar to the Nginx where we set a proxy to pass all the requests to uWSGI. A sample configuration is illustrated below: <VirtualHost ∗:∗> ProxyPass / uwsgi://127.0.0.1:3031/ ServerName localhost </VirtualHost></p><p>5.5.4 Alternative Deployment One may assume the proxy configuration for Nginx and Apache is done by using HTTP(S) to communicate with uWSGI application, however, it is not true. The communication between the web server acting as a proxy and uWSGI use a specialized protocol specific for uWSGI. However, the uWSGI may be also configured to use HTTP(S) directly, so we can use it without the need for the web server. Other alternatives to uWSGI, which can be used for running the applica- tion, also primarily used HTTP as a communication protocol:</p><p>36 5. IMPLEMENTATION</p><p>• Gunicorn [7] is a non-blocking server using pre-fork model for creating workers that handle requests. It supports eventlets and greenlets to avoid blocking during requests.</p><p>• Tornado [16] is also a non-blocking server based on epoll to handle thousands of requests per second and it is greatly scal- able which makes it a perfect choice for high traffic sites.</p><p>• Twisted Web [27] is part of a large framework aimed for net- work software development which also includes a WSGI com- pliant web server. It uses an event-driven programming to han- dle requests.</p><p>5.6 Extending the Application</p><p>5.6.1 Adding New YARA Rules By modifying the rules/generic.yara file, the new YARA rule can be added. The syntax of YARA rules is available at the YARA project home page [2]. It is not necessary to restart any services. How- ever, this file has to be distributed to all cluster nodes.</p><p>5.6.2 Changing the Configuration Database and cluster configuration is located in config.py file. The cluster behavior can be further customized by adding supported Celery options according to Celery documentation [6]. If the configu- ration is changed, the cluster and web interface have to be restarted.</p><p>5.6.3 Adding New Operation New operations can be added by editing a file decoder.py . The added operation must have the same format as existing operations, such as accepted arguments and yielded results. When new opera- tion is added, the dictionary in the same file with a name decoders has to be modified to include the operation. The key in the dictionary is the name used as reference in web interface. The reference key has to be added to select box element in templates/sandbox.tpl .</p><p>37 5. IMPLEMENTATION</p><p>Additionally the range of keys for the operation can be set by ex- tending the array located in static/js/ng/mainapp.js at the location of SanboxCtrl controller for AngularJS.</p><p>38 Chapter 6 Testing</p><p>I have received a few samples from my advisor to use them for the testing of my application. Some of them cannot be made published, so that they are not attached to this thesis. Below I list some of the most interesting samples.</p><p>6.1 Unknown Malware</p><p>The first sample I tested was recognized only by Avira antivirus as TR/Downloader.Gen malicious as tested on VirusTotal [46] (Ta- ble A.1A). This malware does not have any assigned signature at the time of writing and during reverse engineering it was determined that it contains obfuscated data embedded inside the executable. For the analysis, I chose the automatic analysis option to scan the file using default operations. The analysis revealed the following deob- fuscated data (Figure A.8) using the XOR(0x59) -> AND(0xFF) operation:</p><p>• User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729B)</p><p>• Accept-Encoding: gzip, deflate</p><p>• GET %s HTTP/1.1</p><p>• SYSTEM\CurrentControlSet\ServicesYYY|*</p><p>• cmd.exe</p><p>39 6. TESTING • %SystemRoot%</p><p>• YYYHelpSupportYSecurityYYYYhelpsvcYMYYYcmd.exe</p><p>Based on the findings listed above, we can conclude that this mal- ware is performing callbacks using standard HTTP protocol with the given user agent which can be used as a signature for network IDS to find potentionally compromised hosts by this malware.</p><p>6.2 Kordeef Trojan Horse</p><p>This trojan horse contains a dll library kb.dll which loads and ex- ecutes content of the file C:\WINDOWS\system32\dll , however, this file is obfuscated using XOR operation. Kardeef infects the sys- tem processes Explorer.exe and Winlogon.exe which then use the kb.dll to load the malware payload. At the time of writing, it was marked as malicious by 11 of 46 antivirus engines when scanned on VirusTotal (Table A.1B) . I obtained this file and used the auto- mated analysis to find the obfuscation algorithm used. Output snip- pet of findings (Figure A.7) using the analysis is following: • HTTP/1.1</p><p>• Content-Type: text/html</p><p>• Software\Microsoft\Windows\CurrentVersion \Explorer\LowRegistry</p><p>• SOFTWARE\Microsoft\Windows NT\CurrentVersion \SystemRestore</p><p>• Software\Microsoft\Windows\CurrentVersion \Explorer\User Shell Folders</p><p>• SYSTEM\CurrentControlSet\Services\sr \Parameters</p><p>• http://rds.yahoo.com</p><p>• http://google.ru</p><p>40 6. TESTING • firefox.exe</p><p>• opera.exe</p><p>• chrome.exe</p><p>• iexplore.exe</p><p>As we can see from the indicators that was found during analysis, Kordeef hooks into the web browsers and modifies their behavior, in fact, it modifies the search results displayed on web pages.</p><p>6.3 Infected Word Document</p><p>I obtained several infected Word documents that contain the ex- ploit for CVE-2011-0611 with an obfuscated payload. These docu- ments are Japan Nuclear Program.doc (Figure A.5, Table A.1C) and Message from Anne.doc (Figure A.6, Table A.1D). The selec- tion of interesting results from automated analysis (for operation XOR(0x85)) are following:</p><p>• 0xD0CF11E0A1B11AE1 - MS OLE2 header</p><p>• .rsrc</p><p>• CMd.eXe</p><p>• KERNEL32.DLL</p><p>• oleaut32.dll</p><p>• shell32.dll</p><p>• user32.dll</p><p>• Windows.Com</p><p>• self.bat</p><p>• Remove.dll</p><p>41 6. TESTING</p><p>The exploit contained within these documents leverage the vul- nerability by exploiting the Flash Objects embedded inside the doc- ument to execute an arbitrary code, hence the MS OLE2 header that was found during analysis.</p><p>42 Chapter 7 Conclusions</p><p>7.1 Future Improvements</p><p>The current application version creates a separate task for each analysis. These tasks are automatically distributed between the clus- ter nodes. The disadvantage of this configuration is that the node, that receives the task, performs whole brute force alone. This can be further improved by splitting each analysis into a smaller subtasks which can better utilize cluster resources, because they will be dis- tributed among other nodes.</p><p>An integration with open-source antivirus ClamAV [24] can be im- plemented by automatically submitting each sample for scanning. This can be used to select a specific deobfuscation algorithm based on the antivirus scan results. It can be also used as an extension to the YARA signatures which are currently used to determine if the deobfuscation algorithm and the key have been found.</p><p>A similar functionality that ClamAV provides can be implemented by using a public service. VirusTotal [46] scans samples using more than 40 antivirus engines. The disadvantage is that the current public API has a limitation of approximately one thousand requests per day.</p><p>An advanced heuristics can be integrated into automated analysis by using a disassembler or debugger by looking for a set of instruc- tions that are used to decode the payload. There exist many Python frameworks that can disassemble code or run executable in emulated environment such as diStorm [9], Pyasm [4], pyemu [33] etc.</p><p>43 7. CONCLUSIONS</p><p>An option for the users to upload their own YARA rules in the administration section can be added, so they do not need to insert their rules every time they do not want to use default set of rules.</p><p>Deobfuscation can be extended by adding a permutation option which would alter the order of selected functions.</p><p>An event system for plugins can be added by using entry points of Python packages. It will allow external packages to extend the appli- cation and automatically post-process deobfuscated data.</p><p>Performance can be increased by modifying a code to take advan- tage of alternative Python interpreters, such as PyPy [35] which has a built-in Just in Time (JIT) <a href="/tags/Compiler/" rel="tag">compiler</a>. An alternative performance speed up can be achieved by rewriting a portion of brute force code into a C/C++ language and compiling it as a Python module.</p><p>7.2 Conclusion</p><p>In this thesis I have described common malware categories and malware lifecycle. I have introduced malware obfuscation applied to each lifecycle phase. I have researched and described existing tools that help with deobfuscation during the malware analysis. During the research, I have found that many of these tools are aimed only on specific obfuscation technique, so there is a lack of tools that support multiple techniques and their combinations. Based on my findings, I have defined a set of functional and non-functional requirements for a new deobfuscation tool and have analysed and compared different approaches and available technologies.</p><p>Then I have created the deobfuscation tool with intuitive interface using available web technologies. In this thesis I have also described various deployment options for different environments. I have col- lected several obfuscated malware samples and tested my applica- tion on them. The testing results are attached.</p><p>44 7. CONCLUSIONS</p><p>This tool will be used by a global Computer Incident Response Team (CIRT) at Honeywell International Inc.</p><p>45 Bibliography</p><p>[1] Gautam Aggarwal. “Be the Change.” Test Methodologies for Advanced Threat Prevention Products. https : / / www . fireeye . com / blog / executive - perspective / 2013 / 10 / be - the - change - test - methodologies-for-advanced-threat-prevention- products.html. [Online, 2014/12/20]. [2] Victor Manuel Alvarez. YARA - The pattern matching swiss knife for malware researchers. http : / / plusvic . github . io / yara/. [Online, 2014/12/18]. [3] Jeremy Ashkenas. Backbone.js. http://backbonejs.org/. [Online, 2014/12/20]. [4] Florian Boesch. Pyasm - Python x86 Assembler. http : / / codeflow . org / entries / 2009 / jul / 31 / pyasm-python-x86-assembler/. [Online, 2014/12/20]. [5] Harlan Carvey. Windows Forensic Analysis DVD Toolkit. 2nd. Syngress, 2009. ISBN: 978-1597494229. [6] Celery - Configuration and defaults. http : / / docs . celeryproject . org / en / latest / configuration.html. [Online, 2014/12/15]. [7] Benoit Chesneau. Gunicorn - Python WSGI HTTP Server for UNIX. http://gunicorn.org/. [Online, 2014/12/15]. [8] Oracle Corporation. MySQL :: The world’s most popular open source database. http://www.mysql.com/. [Online, 2014/12/15].</p><p>46 BIBLIOGRAPHY</p><p>[9] Gil Dabah. diStorm :: Powerfull Disassembler Library for AMD64. http : / / www . ragestorm . net / distorm/. [Online, 2014/12/20]. [10] Vincent Driessen. RQ: Simple job queues for Python. http://python-rq.org/. [Online, 2014/12/15]. [11] Glenn Edwards. NoMoreXOR. https : / / github . com / hiddenillusion / NoMoreXOR. [Online, 2014/12/18]. [12] Ember.js - A framework for creating ambitious web applica- tions. http://emberjs.com/. [Online, 2014/12/20]. [13] Jose Miguel Esparza. XORBruteForcer. http : / / eternal - todo . com / var / scripts / xorbruteforcer. [Online, 2014/12/15]. [14] The Apache Software Foundatio. The Apache HTTP Server Project. http://httpd.apache.org/. [Online, 2014/12/15]. [15] The Apache Software Foundation. Apache CouchDB. http://couchdb.apache.org/. [Online, 2014/12/12]. [16] FriendFeed. Tornado Web Server. http : / / www . tornadoweb . org / en / stable/. [Online, 2014/12/15]. [17] Dave Gamache. Skeleton: Responsive CSS Boilerplate. http://getskeleton.com/. [Online, 2014/12/20]. [18] Brian Grinstead. FileReader.js. http://bgrins.github.io/filereader.js/. [Online, 2014/12/15]. [19] Alexander Hanel. iheartxor. http : / / hooked - on - mnemonics . blogspot . cz / p / iheartxor.html. [Online, 2014/12/18]. [20] Marcel Hellkamp. Bottle: Python Web Framework. http://bottlepy.org/docs/dev/index.html. [Online, 2014/12/12].</p><p>47 BIBLIOGRAPHY</p><p>[21] Fraser Howard. Exploring the Blackhole exploit kit. http : / / nakedsecurity . sophos . com / exploring - the-blackhole-exploit-kit-3/. [Online, 2014/12/10]. [22] MongoDB Inc. MongoDB. http://www.mongodb.org/. [Online, 2014/12/20]. [23] Keith Jarvis. CryptoLocker Ransomware. http : / / www . secureworks . com / cyber - threat - intelligence/threats/cryptolocker-ransomware/. [Online, 2014/12/10]. [24] Tomasz Kojm. ClamAV. http : / / www . clamav . net / index . html. [Online, 2014/12/12]. [25] Joni Korpi. Less Framework 4. http://lessframework.com/. [Online, 2014/12/20]. [26] Philippe Lagadec. Balbuzard - malware analysis tools to ex- tract patterns of interest and crack obfuscation such as XOR. http://www.decalage.info/python/balbuzard. [On- line, 2014/12/15]. [27] Glyph Lefkowitz. TwistedWeb. http://twistedmatrix.com/trac/wiki/TwistedWeb. [Online, 2014/12/15]. [28] Brat Tech LLC, Google, and community. AngularJS - Super- heroic JavaScript MVW Framework. https://angularjs.org/. [Online, 2014/12/20]. [29] Rolando Murillo. CherryPy - Minimalist Python Web Frame- work. http://www.cherrypy.org/. [Online, 2014/12/20]. [30] Ben Nahorney and Nicolas Falliere. Trojan.Zbot. http : / / www . symantec . com / security _ response / writeup.jsp?docid=2010-011016-3514-99&tabid=2. [Online, 2014/12/10]. [31] Mark Otto and Jacob Thornton. Bootstrap - The world’s most popular mobile-first and responsive front-end framework. http://getbootstrap.com/. [Online, 2014/12/20].</p><p>48 BIBLIOGRAPHY</p><p>[32] Giridhar Pemmasani. dispy: Python framework for distributed and paraller computing. http://dispy.sourceforge.net/. [Online, 2014/12/20]. [33] Cody Pierce. pyemu - A Python IA-32 Emulator. https : / / code . google . com / p / pyemu/. [Online, 2014/12/20]. [34] Inc. Pivotal Software. RabbitMQ - Messaging that just works. http://www.rabbitmq.com/. [Online, 2014/12/12]. [35] PyPy. http://pypy.org/. [Online, 2014/12/15]. [36] John Resig. jQuery. http://jquery.com/. [Online, 2014/12/20]. [37] Armin Ronacher. Flask (A Python Microframework). http://flask.pocoo.org/. [Online, 2014/12/20]. [38] Salvatore Sanfilippo. Redis. http://redis.io/. [Online, 2014/12/12]. [39] Michael Sikorski and Andrew Honig. Practical Malware Anal- ysis: The Hands-On Guide to Dissecting Malicious Software. No Starch Press, 2012. ISBN: 978-1593272906. [40] Abhinav Singh. Metasploit Penetration Testing Cookbook. Packt Publishing, 2012. ISBN: 978-1849517423. [41] Ask Solem. Celery: Distributed Task Queue. http://www.celeryproject.org/. [Online, 2014/12/20]. [42] Didier Stevens. XORSearch & XORStrings. http : / / blog . didierstevens . com / programs / xorsearch/. [Online, 2014/12/15]. [43] Aaron Swartz. web.py. http://webpy.org/. [Online, 2014/12/20]. [44] Igor Sysoev. Nginx. http://nginx.org/. [Online, 2014/12/15]. [45] The uWSGI project. https://uwsgi-docs.readthedocs.org/en/latest/. [Online, 2014/12/15].</p><p>49 BIBLIOGRAPHY</p><p>[46] Virus Total - Free Online Virus, Malware and URL Scanner. https://www.virustotal.com/. [Online, 2014/12/20].</p><p>50 Appendix A Attachments</p><p>Figure A.1: Application homepage</p><p>Figure A.2: Administration interface</p><p>51 A.ATTACHMENTS</p><p>Figure A.3: Analysis configuration</p><p>52 A.ATTACHMENTS</p><p>Figure A.4: Details view of a result</p><p>53 A.ATTACHMENTS</p><p>Figure A.5: Details for Japan Nuclear program.doc</p><p>54 A.ATTACHMENTS</p><p>Figure A.6: Details for Message from Anne.doc</p><p>55 A.ATTACHMENTS</p><p>Figure A.7: Details for Kardeef trojan</p><p>56 A.ATTACHMENTS</p><p>Figure A.8: Details for unknown malware sample</p><p> rule ms ole { meta : description = ”Possible OLE2 header”</p><p> s t r i n g s : $ s t r 0 = {D0 CF 11 E0 A1 B1 1A E1}</p><p> condition : $ s t r 0 } Listing A.1: Example YARA rule</p><p>57 A.ATTACHMENTS</p><p>ID Name URL A Unknown malware http://goo.gl/7hXyhJ B Kardeef trojan http://goo.gl/2YeV6Q C Japan Nuclear Program.doc http://goo.gl/jDxmFB D Message from Anne.doc http://goo.gl/zlH6u6</p><p>Table A.1: Testing samples uploaded to VirusTotal</p><p>58 Appendix B Contents of attached ZIP archive</p><p>• Folder thesis that contains:</p><p>– Source code of the thesis in .tex format – Bibliography of the thesis in .bib format – Generated PDF version of my thesis – Images used in my thesis</p><p>• Folder source code that contains:</p><p>– createenv.sh script to create a virtual Python environ- ment – List of Python requirements in the requirements.txt file – Source code of the tool in the webinterface folder</p><p>59</p> </div> </article> </div> </div> </div> <script type="text/javascript" async crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8519364510543070"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = '2016c61df06bb15c8dca270c24ead095'; var endPage = 1; var totalPage = 64; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/2016c61df06bb15c8dca270c24ead095-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 7) { adcall('pf' + endPage); } } }, { passive: true }); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>