<<

Masaryk University Faculty of Informatics

System for detection of with and other malicious content

BachelorŠs Thesis

Tomáš Ševčovič

Brno, Fall 2017

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Tomáš Ševčovič

Advisor: prof. RNDr. Václav Matyáš, M.Sc., Ph.D.

i Acknowledgement

I would like to thank prof. RNDr. Václav Matyáš, M.Sc., Ph.D. for the management of the bachelor thesis, valuable advice and comments. I would also like to thank the consultant from CYAN Research & Development s.r.o., Ing. Dominik Malčík, for useful advice, dedicated time and patience in consultations and application development. Also, I would like to thank my family and friends for their support throughout my studies and work on this thesis.

ii Abstract

The main goal of this bachelor thesis is to create a system for detection of websites with phishing and other malicious content with respect to Javascript interpretation. The program should be able to download and process thousands of domains and get positive results. The Ąrst step involves examining an overview of automated web testing tools to Ąnd an optimal tool which will be used in the main implementation. The thesis contains overview of technologies for testing, their comparison, overview of malware methods on websites, implementation and evaluation of the system.

iii Keywords

Chrome, Javascript, link manipulation, malware, phishing, URL redi- rects, XSS, Yara

iv Contents

1 Introduction 1

2 Overview of approaches to website testing 3 2.1 Manual testing ...... 3 2.2 Automated testing ...... 4 2.2.1 ...... 5 2.2.2 Other website testing tools ...... 8

3 Comparison of tools for automated website testing 11 3.1 Criteria ...... 11 3.2 Compared tools ...... 13 3.3 Evaluation ...... 15 3.3.1 Conclusion of comparison ...... 17

4 Detection of phishing and other malicious content 19 4.1 Malicious content on websites ...... 19 4.1.1 Phishing ...... 19 4.1.2 Other malicious content ...... 21 4.2 Detection of phishing ...... 22 4.2.1 Cross-site Scripting ...... 22 4.2.2 URL Redirects ...... 23 4.2.3 Link Manipulation ...... 24 4.2.4 Imitating trusted entity ...... 25 4.2.5 Detection of other malicious content ...... 25

5 Implementation 26 5.1 Design ...... 26 5.2 Tools and libraries ...... 27 5.2.1 Chrome ...... 27 5.2.2 Wget ...... 29 5.2.3 Beautiful soup ...... 29 5.2.4 PyLibs ...... 29 5.2.5 Yara patterns ...... 30 5.3 Input ...... 30 5.4 Output ...... 31

v 6 Evaluation of results 32 6.1 Optimization ...... 32 6.1.1 Parallelism ...... 32 6.1.2 Database ...... 33 6.2 Execution times ...... 33 6.2.1 Conclusion ...... 33 6.3 Results ...... 34 6.3.1 Comparison with Google safe browsing . . . . . 35 6.3.2 Efect of Javascript interpreter ...... 36 6.4 Further work ...... 37

7 Conclusion 38

Bibliography 39

vi List of Figures

2.1 Selenium IDE plug-in for . 6 3.1 Worldwide share of the usage of layout engines in November 2017. Data collected from [10]. 13 3.2 Worldwide share of usage of Javascript engines in November 2017. Data collected from [10]. 14 3.3 Example of script for downloading a website in PhantomJS. 14 3.4 Overview of main characteristic of headless browsers. 17 3.5 Result of /usr/bin/time -v command. 18 4.1 How phishing works [18]. 20 5.1 A diagram of the program. 26 5.2 Output of one website. 31 6.1 Usage of RAM by Chrome. 32 6.2 The performance of testing PC. 33 6.3 Average execution times per page. 34 6.4 Average execution times by percentage. 34 6.5 Ratio of exposed malware per million domains. 35 6.6 Average execution times by percentage. 35 6.7 Results of detection in one million domains. 36 6.8 Comparison of results of Chrome and Wget. 37

vii 1 Introduction

Every day, malicious content on the Internet attacks numerous users in every corner of the world. Deceptive techniques designed to obtain sensitive information from the user, by acting like a trustworthy entity, often appear within the web content. These techniques are known as phishing. Except phishing there are more threats on the Internet which can be injected via Javascript. All it often takes is just downloading an unveriĄed Ąle that can contain a computer virus. The aim of this bachelor thesis is to explore the area of available test tools and technologies for the detection of such websites and applica- tions. These instruments must be able to interpret the Javascript code of a website, acquire all its content and then work with it. Another aim is to compare the instruments to each other and to make an informed choice of the best one for the practical part of this thesis. A further objective of this thesis is to create a system that will be able to detect phishing and other dangerous content that appears on the websites, using the selected tool. The created system has to work eiciently and has to be implemented and work on a Linux server. I prepared an overview of the available options on how to test and retrieve the content of the websites. For viewing, handling or automated testing of the web content, a basic rendering layout engine is always required. For interpreting Javascript is needed a Javascript engine within the tool. Among the most common options that can process and test web content are , accessories for various web browsers, tools or libraries utilizing the environment of the browser as a means of obtaining content (e.g. Selenium). The only option for interpreting Javascript within websites and running in the background of a server are headless browsers. For the creation and implementation of a detection tool for ma- licious content, an extensive study of techniques which are used by attackers for to deceive users is necessary. Then, patterns need to be found which can detect certain malware or which determine the oc- currence of the searched malware. Phishing has many methods like cross-site scripting for injection of dangerous code to the website or URL redirect which moves a user to unwanted (mostly phishing) web- site. There are also viruses and trojan horses on the websites which

1 1. Introduction

can be detected by checking if Javascript code contains the malware patterns. The implementation is designed for use in the background of a Linux system and with the possibility to process thousands of domains. There are a lot of ways to detect malicious content on the website. In this implementation, detection by user’s point of view when they come across an infected website was chosen. This means that the program detects by information from DOM and by the domain name. Chapter 2 is an overview of methods for website testing. Chapter 3 is about comparison of tools for automated website testing and about its conclusion of optional tool for the main implementation. Chapter 4 includes a summary of website malware methods and their detection. Chapter 5 contains a design of the main implementation and tools which were used there. The sixth Chapter describes evaluation of random inputs of the program and commentary on the results of individual types of detection and how to Ąx weaknesses and slow parts of the program. The Ąnal chapter concludes this thesis.

2 2 Overview of approaches to website testing

2.1 Manual testing

One of the Ąrst types of testing at hand is manual testing, which has four major stages for minimizing the number of defects in the applica- tion: , integration testing, system testing, user acceptance testing. The tester must impersonate the end-user and use all the features in the application in order to ensure error-free behavior. Information in this chapter is gathered from [1].

Unit testing: Manual unit testing is not much used nowadays. It is an expensive method, because the test is done manually. Testers have a test plan and must go through all the prepared steps and cases. It is very time-consuming to perform all the tests. This disadvantage is solved through automated unit testing.

Integration testing: Tests are not prepared by a developer but by a test team. Flawless communication between the individual compo- nents inside the application must be veriĄed. Integration can also be veriĄed between the components and the , hardware or system interface.

System testing: After the completion of unit and integration testing the program is veriĄed as a whole complex. It veriĄes the application from the customer’s perspective. Various steps that might occur in practice are simulated based on prepared scenarios. They usually take place in several rounds. Found bugs are Ąxed and in the following rounds, these Ąxes are tested again.

User acceptance testing: If all the previous stages of the tests are completed without major shortcomings the application can be given to the customer. The customer then usually performs acceptance tests with their team of testers. Found discrepancies between the application and speciĄcations are reported back to the development team. Fixed bugs are deployed to the customer’s environment.

3 2. Overview of approaches to website testing 2.2 Automated testing

Automated testing is a process of automating manual tests using automated instruments such as Selenium. Automated testing has several advantages over manual testing. It prevents errors where a part of the test is left out. In automated testing, the same code is always performed so there is no room for human error, such as a bad entry into the input Ąeld. Although at the beginning it is necessary to spend time creating the automated tests, a lot of time is saved ultimately because everything is tested automatically, tests can be run parallely on multiple platforms and faster than if the individual acts were carried out by people. In addition, tests can be run without greater eforts after any major change in the program of the tested application. The website document was tested by DOM in combination with the XPath language which is used to address individual elements of the website. Information in this paragraph is drawn from [2].

DOM: (DOM) is a platform and language- independent interface and W3C international consortium standard. With DOM, programs and scripts can access the contents of HTML and XML documents, or change the content. DOM allows you to access the document tree, add or remove individual nodes. The root node of the tree is always the Document node, representing the document as a whole. The speciĄcations of this standard are divided into several levels where a newer level extends the original levels and preserves backward compatibility. Information in this paragraph is drawn from [3].

XPath: XPath is a language for addressing nodes in XML documents. The result of an XPath query is a set of elements, or value of an attribute of a given element. XPath provides many ways to address a speciĄc element. Both relative and absolute address (assigned by individual elements from the root to the given element) can be used. Furthermore it contains many predeĄned functions for addressing the ofspring

4 2. Overview of approaches to website testing

and parents of the current node. It also contains commands for Ąnding an element with a speciĄc attribute value and many other features. Information in this paragraph is drawn from [4].

2.2.1 Selenium Selenium is a suite of open source tools for automated testing of web applications. It can be used for diferent browsers (Firefox, Chrome, , Edge) including headless browsers like PhantomJS or Htm- lUnit. Selenium is a cross-platform and can be controlled by many programming languages and test frameworks. Selenium attempts to interact with the browser as a real user. This set of tools includes four powerful components that will be described in the following paragraphs.

Selenium IDE: Selenium IDE is used for creating tests and auto- mated tasks using add-ons for Firefox (see Ągure 2.1). These are made by a user recording his or her activity which is then written in the of individual orders into a table. The table can also be manually edited with individual commands. These commands can be basic commands, such as click, insert text, but also advanced methods of verifying the title of the website, or various assertions similar to those of the JUnit test library for Java language. Individual commands contain three items. The Ąrst item is the command name, the second one speciĄes the targeted command element, the last one contains the command value. Created tests (commands) can also be displayed in the HTML code and it is also possible to export them into one of the programming lan- guages Java, Ruby, # and then use them in the Selenium WebDriver described below.

Selenium Remote Control: Also known as Selenium 1, consists of two main parts. The Ąrst part is the Selenium Remote Control Server, which start- s/ends the and interprets and runs Selenium commands from the test program. These commands are interpreted using the JavaScript interpreter of a given web browser. After executing the com- mand, it returns the result of these commands to the testing program.

5 2. Overview of approaches to website testing

Figure 2.1: Selenium IDE plug-in for Mozilla Firefox.

Furthermore, it works as a HTTP Proxy, capturing and authenticating HTTP between the web browser and the testing program. The server accepts Selenium commands by using simple HTTP GET/- POST requests, so a wide variety of programming languages can be used, which can send HTTP requests back to the testing program. The second part are the client libraries that provide software inter- face between the programming language and the Selenium RC Server. The supported programming languages are Java, Ruby, Python, Perl, PHP, .NET, and Selenium ofers a client library for each of them. These client libraries take Selenium commands and then transmit them to the Selenium server to be performed which then returns the return value of the command.

WebDriver: The main change in Selenium 2 is an integration of the WebDriver API. It was created to better support dynamic websites on which the elements can be changed without the need for the whole website to be reloaded.

6 2. Overview of approaches to website testing

The WebDriver API function is such that when an instance of a class which implements WebDriver interface is created, a browser opens a new window and communicates with the instance. This instance can be created using the ChromeDriver, FirefoxDriver and other classes. Selenium 2 creates direct calls to a web browser using the native support for automation. How the direct calls are implemented de- pends on the web browser on which they are implemented, unlike Selenium RC which is forced to use the Javascript code because of the web browser code and induce appropriate actions that way. With an increasing number of tests and their increasing complexity, diferent limits are beginning to show. Selenium 2 can be very slow when in control of the browser and the Remote Control Server may become a bottleneck of testing. Parallel execution of multiple concur- rent tests on the same Remote Control Server will begin to manifest itself through reduction of the stability of tests (it is not recommended to run more than six concurrent tests; this number is even smaller in ). Because of these limitations, Selenium tests are generally run consecutively in sequence or slightly in parallel. These problems are solved by Selenium Grid.

Selenium Grid: Selenium Grid takes advantage of the fact that the tested and Remote Control Server/web browser does not have to be running on one computer, but on multiple computers communicating with each other using HTTP. A typical use of Selenium Grid is that we have a certain set of Selenium Remote Control Servers that can be easily shared across dif- ferent versions, test applications and projects of which each Selenium Remote Control may ofer various platforms and versions, or types of web browsers. Selenium Remote Control informs the Selenium Hub about which platforms and browsers are provided. The Selenium Hub allocates Selenium Remote Control for speciĄc test requirements. It can request a speciĄc platform or version of the web browser. Furthermore, it limits the number of concurrent tests and shadows Selenium Grid architecture of the tests. Shading Sele- nium Grid architecture is advantageous because a change of Selenium WebDriver to Selenium Grid in the test program code requires almost no change.

7 2. Overview of approaches to website testing

2.2.2 Other website testing tools Watir: Project Watir (Web Application Testing in Ruby) is a set of open-source (BSD) Ruby libraries for automating the web browser. Watir interacts with the web browser just like a real user, it is able to click on and buttons, Ąll out forms, and it lets you control whether to display the expected text on the website. The project is multi-platform and supports Chrome, Internet Ex- plorer, Firefox, , Safari. Browsers are controlled in a diferent way than HTTP test libraries (e.g. Selenium) are. Watir controls the browser directly using the "Object Linking and Embedding" protocol which is developed in the "Component Object Model (COM)". The process of the web browser serves as an object, making its methods accessible for automation. These methods are then rendered from user programs in Ruby language. Information in this paragraph is drawn from [5].

Huxley: A test system which, unlike the other systems mentioned in this document, tests content based on visual appearance (comparing screenshots). Automated tests cannot visually determine that some- thing is not right. This problem is solved by the Huxley system. It is written in the Python scripting language and can work in two modes. The Ąrst mode is Playback where individual tests created in record mode are run. It works by comparing the newly created screenshots with those that were taken earlier. If the images difer, notiĄcation appears and the designer checks whether the error was indeed in the user interface. The second mode is Record where using the Selenium WebDriver, a new browser window is opened that records user actions. If the user wants to compare screenshots at a given point, he or she presses enter. Information in this paragraph is drawn from [6].

Splash: A Javascript rendering service that is used for testing the behavior of a website or for taking screenshots of the website. Splash is written in Python, uses HTTP user interface, allows you to execute multiple websites in parallel and to use Ad-block Plus Ąlters for faster rendering of websites. Its test scripts are written in Lua language. Information in this paragraph is drawn from [7].

8 2. Overview of approaches to website testing

SauceLabs: It is a service that works on the principle of Selenium Grid and simpliĄes application testing for the developers in diferent operating systems or web browsers. SauceLabs allows connection with the Selenium test library. There is a possibility for manual testing, where the user selects a platform and browser for testing. After the selection, a window opens with remote desktop of the selected platform with a running browser. Furthermore, SauceLabs ofers testing websites on Android and iOS using the open source automation tool named Appium. This tool can test both native and hybrid applications, it can even test the mobile version of Safari on iOS. The cheapest version of this service costs $19 per month for a single virtual computer. The most expensive version costs $52 per month for eight virtual computers and 4000 minutes of automated testing. These prices are as of 1 August 2017. Information in this paragraph is drawn from [8].

BrowserStack: This service is similar to SauceLabs, it ofers manual testing called Live, automated testing using Selenium and service generating preview (screenshot) of the website. BrowserStack compared to SauceLabs ofers many types of licenses. Automatic testing with two virtual machines costs $99 per month and with Ąve machines it costs $199 per month. These prices are as of 1 August 2017. Information in this paragraph is drawn from [9].

Headless browser: It is a type of browser that connects to a speciĄc website, but without the GUI. This means that the user cannot see any result of the connected website. These browsers support Javascript, DOM and contain a rendering layout core like WebKit or . They do not support audio, video, plugins, Flash, Java, WebGL, geolocation, CSS 3-D, ambient light or speech. Most of them can include and execute user’s script and use DOM events like clicking or typing for an user imitation. Headless browsers are used for providing content for programs or scripts, automated testing, analyzing performance, monitoring the network or taking a screenshot of the entire website content. They

9 2. Overview of approaches to website testing are also used to perform a DDoS attack or increase advertisement impression. Headless browsers are described in more detail in Section 3.1.

10 3 Comparison of tools for automated website testing

In this Chapter, selected tools and their comparison are brieĆy de- scribed with the goal of selecting the best tool for the main imple- mentation in this bachelor thesis. Six diferent headless browsers were chosen according to diferent combinations of their cores and program- ming languages. They are PhantomJS, SlimerJS, HtmlUnit, ZombieJS, Ghost.py and (in headless mode). Headless browsers are the best choice for this implementation, as all the browsers run without GUI and only in the background of the system. This also decreases the load on operating memory and CPU workload. Additionally, they contain a Javascript interpreter, which means that they can return the contents of the website to the output only after an execution of the JavaScript code that the website contains.

3.1 Criteria

For a suitable tool, a solution is needed that can run in the background of a Linux operating system. Therefore, tools that work only with other operating systems have been excluded. Furthermore, the tool must be able to interpret Javascript code. Not all headless browsers guarantee a Ćawless and full implementation of Javascript, which is extremely important for the implementation to make sure some kinds of suspicious behavior do not pass unnoticed. These criteria are given by the speciĄcations of the bachelor thesis. Moreover, the tool should have a comprehensible documentation for comfortable studying of the options of its API. It would also be good for the tool to have a user forum with developers or other users of the tool who would be able to deal with potential problems that may arise during implementation. After discovering a malfunction of any of the tool components, the tool should have the opportunity to report errors that would be Ąxed in future versions. In that case it would be appropriate to choose a tool that is constantly being developed, possibly one with regular updates or at least one that Ąxes occurring bugs. These criteria are not present in the assignment of the bachelor

11 3. Comparison of tools for automated website testing

thesis; however, for an eicient and practical use of the tool, they need to be included as not all tools necessarily fulĄll them and they may be key to the illustrative function of the thesis implementation. The tool should be able to process the largest number of domains speciĄed on the input in the shortest possible time (e.g. 100,000 do- mains) with the lowest RAM workload possible. These parameters will be shown in section 3.3. where a comparison of speed and utiliza- tion of operating memory for each tool will be shown. The selected tool should contain a commonly used Layout rendering core and a Javascript rendering core. These cores are also used by regular web browsers and because of the use by these browsers, they are a much better match than cores created just for a speciĄc tool. These criteria are also not listed in the speciĄcations of the bachelor thesis, but they are important for selecting a quality, universal tool.

Layout rendering core: Also known as web browser engine, it is a program for rendering of the website. It renders marked-up content like HTML, XML or images, and formats information such as CSS styles. The layout rendering core does not have to be used only in web browsers, but also in every system or device that somehow accesses the content of the website. The layout core with the highest utilization is WebKit, which is used by Safari from Apple or Google Chrome, which uses a new version, a fork called core. Second place is held by Gecko, which is mainly used in Mozilla Firefox. Internet Explorer uses core, uses EdgeHTML and Opera uses Presto.

Javascript rendering engine: This engine is a program that is re- sponsible for executing Javascript code and is mainly used in web browsers. The engine with the largest worldwide share is V8 which is de- veloped by Google and mainly used in Google Chrome and Opera. Firefox has SpiderMonkey engine (early versions used RhinoJS). In- ternet Explorer and Microsoft Edge have the same Javascript engine called Chakra. Safari uses JavaScriptCore, which is part of the present WebKit.

12 3. Comparison of tools for automated website testing

Figure 3.1: Worldwide share of the usage of layout engines in Novem- ber 2017. Data collected from [10].

3.2 Compared tools

PhantomJS: A cross-platform headless browser written in C++ that uses rendering layout engine WebKit with scriptable Javascript API. Javascript rendering engine is JavaScriptCore (like Safari). PhantomJS supports working with the DOM, HTML5, CSS3 selectors, JSON, SVG and Canvas. Its API allows the user to open a website, modify the con- tent of the website, click on links, automated website testing, jQuery support, set or change cookies, capture screenshots, listen to the net- work requirements and subsequently convert them into HAR format, simulated keyboard and mouse and read/write Ąles. PhantomJS itself is not a testing framework, it must include a test runner like CasperJS, Jasmine, QUnit, etc. It may also be used as a browser (instance) for the Selenium WebDriver. PhantomJS is the most used headless browser mainly for scraping the content of the website or testing the website. Documentation is not complete, but the website of the tool also contains many useful examples for its diferent uses and utilization of its options. Information in this subsection is drawn from [11].

SlimerJS: A tool similar to PhantomJS, but with other rendering engines. The layout rendering core is Gecko by Firefox and Javascript rendering engine is SpiderMonkey. It uses a very similar API to Phan-

13 3. Comparison of tools for automated website testing

Figure 3.2: Worldwide share of usage of Javascript engines in Novem- ber 2017. Data collected from [10].

Figure 3.3: Example of script for downloading a website in PhantomJS.

tomJS but some features are missing. The developers of this tool are trying to make API exactly same as PhantomJS, but it is still in the pro- cess. SlimerJS is not really a headless browser, after it launches Firefox browser is also opened until the script is completed (dependence on Firefox should be removed in the next version of the tool). It runs on platforms on which Firefox can run. It also supports features such as audio, video, etc. SlimerJS can become headless only in the case of using xvfb tool (display server). The tool has a dependency on Firefox and will not launch without its installation. Information in this subsection is drawn from [12].

HtmlUnit: It is a headless browser written in Java and used for Java implementation. It supports two diferent rendering layout engines that you can choose from (WebKit and Gecko). The Javascript render- ing engine is RhinoJS (developed by Mozilla) which is used very little

14 3. Comparison of tools for automated website testing

nowadays. HtmlUnit supports many Javascript libraries for unit tests such as jQuery, MochiKit, Dojo, etc. It can work with (Asyn- chronous JavaScript and XML) and compared to PhantomJS, does not yet include full support of Javacript. Information in this subsection is drawn from [13].

ZombieJS: This headless full-stack testing tool [14] using Node.js allows the user to interact with the website, but the tool itself does not allow the user to do assertions. This requires a test framework that runs on Node.js, such as Mocha or Jasmine. The testing framework allows the user to run tests serially and to do Ćexible reporting of tests, which makes testing easier. To install ZombieJS, Node.js libraries and io.js that are necessary for the functionality of the tool.

Ghost.py: Python library [15] for testing and scraping of web content. Ghost.py uses the WebKit web client for accessing website content and Javascript interpreter JavaScriptCore. For proper operation, an installation of PySide or PyQt is required.

Google Chrome: Freeware web browser developed by Google which also includes a headless environment from version 59 on Mac and Linux (60 on Windows) [16]. Its web engine is Blink which is a fork of WebKit and its Javascript interpreter is V8. The headless environment in Linux is executed by the following command:

1 google−chrome−s t a b l e −−headless −−disable−gpu http://example . com Listing 3.1: Command for the execution of headless chrome.

The command includes three main features which are creating a pdf (–print-to-pdf), printing the DOM (–dump-dom) and taking a screen- shot of website with given resolution (–screenshot –window-size).

3.3 Evaluation

Written in: The tool which will be chosen should process even thou- sands of websites in a short time. The best option of ofered languages

15 3. Comparison of tools for automated website testing

in which the instrument is written is C++ due to its speed compared to other languages of the selected instruments.

Supported language: The best choice of supported language is Javascript, because in the main implementation the best solution is to work with Javascript code of the website to Ąnd patterns of phishing cases which are mostly written in the code. It is also possible to use many Javascript libraries as jQuery with many useful functionalities and for better clar- ity of the code.

Layout rendering core: It determines the same website view as the user has. To see exactly the same output as the majority of worldwide users, the best solution is to choose the most used layout engine, Blink.

Javascript rendering core: The most suitable Javascript engine for Blink is V8 as they are developed by the same company and they work together well.

Community/forum: Only HtmlUnit, PhantomJS, SlimerJS and Chrome can be listed among active communities which have their forum where users can help each other.

Execution time: A simple script for each tool was written that down- loads the full text of the website after performing a simple Javascript code. The tested website is a locally stored basic HTML website with various basic elements. The Javascript code that appears on this web- site is again a simple script that changes the content of the elements on the website. The results in the table present the average speed of ten attempts of loading the website. The result is very relative but it gives us a basic overview of values in which the results range. The testing was conducted locally in order not to degrade the results in regard to the Internet connection. The Linux command /usr/bin/time with the parameter -v was used for the execution time of each script in speciĄc tools which also shows the maximum RAM usage during the performance and other infor- mation about the memory.

16 3. Comparison of tools for automated website testing

Figure 3.4: Overview of main characteristic of headless browsers.

Testing was conducted on a Lenovo ThinkPad e520 NZ35WMC laptop with a 64-bit operating system Fedora 23.

DOM comparison: The printed DOM after Javascript execution from random websites for each tool were then compared. Surprisingly not every tool returned the same result. PhantomJS, SlimerJS, HtmlUnit and Ghost.py returned an empty page for http://youtube.com and Zom- bieJS also for http://seznam.cz. Chrome headless mode did not have any problem with any of the tested pages and returned the same result as the web browser.

3.3.1 Conclusion of comparison In this chapter, six diferent tools were chosen to be compared with each other. The main criteria for selection of the best tool were the support of Linux, running in the background, active community, tech- nical requirements, including the highest processing speed and the largest use of rendering engines. The analysis shows that the Blink layout rendering core (fork of WebKit) has the largest share of use in web browsers (including mobile devices) in the world, therefore the exact same result of the website as the majority of Internet users have is available using it. The section assessment describes the ideal tool parameters. Most of them are met only by one tool which is PhantomJS. However, Phan- tomJS cannot return the DOM of all the tested websites. The only

17 3. Comparison of tools for automated website testing

Figure 3.5: Result of /usr/bin/time -v command.

tool with correct results is Google Chrome’s headless environment. Therefore this tool will be selected for the main implementation. It is written in C++, it fully and Ćawlessly supports Javascript, it has a Blink layout core and V8 Javascript engine. Additionally, it has the largest community and use in the area of headless browsers. In speed testing of all the tools, it Ąnished as the fastest with an average utilization of operating memory compared to the others.

18 4 Detection of phishing and other malicious content

This chapter gives deĄnition of malicious content that is found on websites. Furthermore it contains a description of methods on how to detect some types of these attacks and malicious content. My thesis is specialized to detect malicious content based on the content of the tested website and on processing URL on given input.

4.1 Malicious content on websites

4.1.1 Phishing Phishing is a way to obtain sensitive data (password or credit card details) from the user for their subsequent abuse by acting like a trustworthy entity. Most often it takes advantage of the gullibility of the user in a way that the user will not notice at Ąrst glance and, in the worst case, the attacker retains the user’s data without their knowledge. One of the Ąrst phishing tactics is sending fraudulent e-mails. These emails contain links to malicious websites that look, for example, like the website of the user’s bank. The website can contain a form for obtaining the user’s information or a link to downloading a malicious Ąle. There are 4 basic ways to spread phishing content to users. The Ąrst one is the email-email way where communication is only by email. The second one is email-website where a user receives an email with a link to an infected website. Another way is website-website that is when a user Ąnds a link on an uninfected website to an infected one. And the last way is browser-website when a user installs a dangerous which redirects the user to phishing websites. Another way is phone phishing. It is a fake call from user’s bank when a caller requires the user’s personal data. Also using the phone, the victim can be induced to share their information by an SMS. Nowadays, the number of websites and users is constantly increas- ing, thus also increasing the creation and spreading of more phishing websites which include intelligent tactics to trick the user. Attackers

19 4. Detection of phishing and other malicious content

are always trying to Ąnd other ways to confuse users with something new that they have not encountered before and therefore the users will be deceived easily. Information in this subsection is drawn from [17].

Figure 4.1: How phishing works [18].

Cross-site scripting: Also known as XSS, cross-site scripting is a trick for the disruption of a website using untreated input errors oc- curring in the scripts of the website. Using this security loophole, the attacker is able to insert a the custom Javascript code that can retrieve sensitive information, damage the website or display diferent content to the user [19].

URL Redirects: This type of phishing is performed so that after visiting the website, the user is redirected to another website. The

20 4. Detection of phishing and other malicious content

redirection can be occurred in Javascript, HTML meta tags or in plug- ins. The redirection does not have to be only to one address, it can also be to several diferent infected websites.

Link manipulation: Most phishing methods involve using a link, which is found in emails or directly on a website. These links look like credible sites, for example the user’s bank. The name of the link is changed so that there are typos in its original version or it uses multiple domains and subdomains that the original version does not. It is to confuse the victim who has no idea that he or she is being redirected to an infected page. Manipulation with the links can be implemented directly in an HTML document or also in Javascript. In this case the address of the link changes right after the page is loaded or after you click on the link.

Imitating trusted entity: The most sophisticated method of phish- ing is when the attacker acts like a trusted entity where the victim can have an account. This method can be in the form of an email com- munication between the user and an entity or the victim can come across a website which is almost identical to the original. The user will write his or her private information to the infected site and then they can be redirected to the original website without their notice. Nevertheless, there are better ways to deceive a user. For example, the victim can open a link which opens an original website with a fake form for entering private victim’s data in a new browser window, this type of phishing is also called "tabnabbing".

4.1.2 Other malicious content Malicious software, or malware in short, includes adware, spyware, viruses, trojan horses, ransomware, rootkit and the above mentioned phishing. Malware is harmful software which enables the attacker to access the victim’s computer. Diferent types of malware occur on websites mostly as download- able binary Ąles which are diicult to detect without downloading the Ąle. However, there are some malware types described below which can appear within the source code of the website.

21 4. Detection of phishing and other malicious content

Spyware: One type of spyware which can be encountered on a web- site is a keylogger. It is a software which scans the activity of the user’s keyboard and sends it back to the attacker’s server. This activity of the keyboard can be speciĄed and scans only the tag with the parameter type set to the value password. By this method an attacker can easily get passwords or credit card numbers of a victim.

Ransomware: It is an "extortionate" method which encrypts all Ąles on victim’s storage and forces the user mostly to pay an untraceable Bitcoin (because Bitcoin transaction are untraceable) ransom for pro- viding the access password to the storage. Websites are used for spread- ing ransomware. Attacker infects a website server with this type of software [20] and an owner is forced to pay for encryption.

4.2 Detection of phishing

There are more situations when a website can be infected. In this thesis, the detection of malicious content is viewed from the user’s point of view. This means that the key question is what can actually happen to a user when he or she comes across an infected website. From this point of view, we can do the detection from the source code and from the given URL. If there is an infected Ąle, it is not detected. It is also not detected (from an administrator’s point of view) if the site is damaged or infected Ąles had been uploaded. This is the reason why this thesis does not include the detection of trojan horses, viruses, and other types of malware.

4.2.1 Cross-site Scripting There are three types of XSS by which an attacker can inject their Javascript code into the website. It is about inserting their own script into inputs on the site or into the URL.

DOM Based: A local type of XSS in which it is possible to move untreated URL variables into the code. To achieve this attack, it is necessary to manipulate this URL variable in the Javascript code as

22 4. Detection of phishing and other malicious content

a document.write(). Then the harmful code can be written directly into the address, for example in the following URL:

1 http://website .com/page.html?variable=

Web Analytics