System for Detection of Websites with Phishing and Other Malicious Content

Masaryk University Faculty of Informatics System for detection of websites with phishing and other malicious content BachelorŠs Thesis Tomáš Ševčovič Brno, Fall 2017 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Tomáš Ševčovič Advisor: prof. RNDr. Václav Matyáš, M.Sc., Ph.D. i Acknowledgement I would like to thank prof. RNDr. Václav Matyáš, M.Sc., Ph.D. for the management of the bachelor thesis, valuable advice and comments. I would also like to thank the consultant from CYAN Research & Development s.r.o., Ing. Dominik Malčík, for useful advice, dedicated time and patience in consultations and application development. Also, I would like to thank my family and friends for their support throughout my studies and work on this thesis. ii Abstract The main goal of this bachelor thesis is to create a system for detection of websites with phishing and other malicious content with respect to Javascript interpretation. The program should be able to download and process thousands of domains and get positive results. The Ąrst step involves examining an overview of automated web testing tools to Ąnd an optimal tool which will be used in the main implementation. The thesis contains overview of technologies for website testing, their comparison, overview of malware methods on websites, implementation and evaluation of the system. iii Keywords Chrome, Javascript, link manipulation, malware, phishing, URL redirects, XSS, Yara iv Contents 1 Introduction 1 2 Overview of approaches to website testing 3 2.1 Manual testing ........................ 3 2.2 Automated testing ....................... 4 2.2.1 Selenium . 5 2.2.2 Other website testing tools . 8 3 Comparison of tools for automated website testing 11 3.1 Criteria ............................ 11 3.2 Compared tools ........................ 13 3.3 Evaluation ........................... 15 3.3.1 Conclusion of comparison . 17 4 Detection of phishing and other malicious content 19 4.1 Malicious content on websites . 19 4.1.1 Phishing . 19 4.1.2 Other malicious content . 21 4.2 Detection of phishing ..................... 22 4.2.1 Cross-site Scripting . 22 4.2.2 URL Redirects . 23 4.2.3 Link Manipulation . 24 4.2.4 Imitating trusted entity . 25 4.2.5 Detection of other malicious content . 25 5 Implementation 26 5.1 Design ............................. 26 5.2 Tools and libraries ....................... 27 5.2.1 Google Chrome . 27 5.2.2 Wget . 29 5.2.3 Beautiful soup . 29 5.2.4 PyLibs . 29 5.2.5 Yara patterns . 30 5.3 Input ............................. 30 5.4 Output ............................ 31 v 6 Evaluation of results 32 6.1 Optimization ......................... 32 6.1.1 Parallelism . 32 6.1.2 Database . 33 6.2 Execution times ........................ 33 6.2.1 Conclusion . 33 6.3 Results ............................ 34 6.3.1 Comparison with Google safe browsing . 35 6.3.2 Efect of Javascript interpreter . 36 6.4 Further work ......................... 37 7 Conclusion 38 Bibliography 39 vi List of Figures 2.1 Selenium IDE plug-in for Mozilla Firefox. 6 3.1 Worldwide share of the usage of layout engines in November 2017. Data collected from [10]. 13 3.2 Worldwide share of usage of Javascript engines in November 2017. Data collected from [10]. 14 3.3 Example of script for downloading a website in PhantomJS. 14 3.4 Overview of main characteristic of headless browsers. 17 3.5 Result of /usr/bin/time -v command. 18 4.1 How phishing works [18]. 20 5.1 A diagram of the program. 26 5.2 Output of one website. 31 6.1 Usage of RAM by Chrome. 32 6.2 The performance of testing PC. 33 6.3 Average execution times per page. 34 6.4 Average execution times by percentage. 34 6.5 Ratio of exposed malware per million domains. 35 6.6 Average execution times by percentage. 35 6.7 Results of detection in one million domains. 36 6.8 Comparison of results of Chrome and Wget. 37 vii 1 Introduction Every day, malicious content on the Internet attacks numerous users in every corner of the world. Deceptive techniques designed to obtain sensitive information from the user, by acting like a trustworthy entity, often appear within the web content. These techniques are known as phishing. Except phishing there are more threats on the Internet which can be injected via Javascript. All it often takes is just downloading an unveriĄed Ąle that can contain a computer virus. The aim of this bachelor thesis is to explore the area of available test tools and technologies for the detection of such websites and applica- tions. These instruments must be able to interpret the Javascript code of a website, acquire all its content and then work with it. Another aim is to compare the instruments to each other and to make an informed choice of the best one for the practical part of this thesis. A further objective of this thesis is to create a system that will be able to detect phishing and other dangerous content that appears on the websites, using the selected tool. The created system has to work eiciently and has to be implemented and work on a Linux server. I prepared an overview of the available options on how to test and retrieve the content of the websites. For viewing, handling or automated testing of the web content, a basic rendering layout engine is always required. For interpreting Javascript is needed a Javascript engine within the tool. Among the most common options that can process and test web content are headless browser, accessories for various web browsers, tools or libraries utilizing the environment of the browser as a means of obtaining content (e.g. Selenium). The only option for interpreting Javascript within websites and running in the background of a server are headless browsers. For the creation and implementation of a detection tool for malicious content, an extensive study of techniques which are used by attackers for to deceive users is necessary. Then, patterns need to be found which can detect certain malware or which determine the oc- currence of the searched malware. Phishing has many methods like cross-site scripting for injection of dangerous code to the website or URL redirect which moves a user to unwanted (mostly phishing) website. There are also viruses and trojan horses on the websites which 1 1. Introduction can be detected by checking if Javascript code contains the malware patterns. The implementation is designed for use in the background of a Linux system and with the possibility to process thousands of domains. There are a lot of ways to detect malicious content on the website. In this implementation, detection by user’s point of view when they come across an infected website was chosen. This means that the program detects by information from DOM and by the domain name. Chapter 2 is an overview of methods for website testing. Chapter 3 is about comparison of tools for automated website testing and about its conclusion of optional tool for the main implementation. Chapter 4 includes a summary of website malware methods and their detection. Chapter 5 contains a design of the main implementation and tools which were used there. The sixth Chapter describes evaluation of random inputs of the program and commentary on the results of individual types of detection and how to Ąx weaknesses and slow parts of the program. The Ąnal chapter concludes this thesis. 2 2 Overview of approaches to website testing 2.1 Manual testing One of the Ąrst types of testing at hand is manual testing, which has four major stages for minimizing the number of defects in the application: unit testing, integration testing, system testing, user acceptance testing. The tester must impersonate the end-user and use all the features in the application in order to ensure error-free behavior. Information in this chapter is gathered from [1]. Unit testing: Manual unit testing is not much used nowadays. It is an expensive method, because the test is done manually. Testers have a test plan and must go through all the prepared steps and cases. It is very time-consuming to perform all the tests. This disadvantage is solved through automated unit testing. Integration testing: Tests are not prepared by a developer but by a test team. Flawless communication between the individual components inside the application must be veriĄed. Integration can also be veriĄed between the components and the operating system, hardware or system interface. System testing: After the completion of unit and integration testing the program is veriĄed as a whole complex. It veriĄes the application from the customer’s perspective. Various steps that might occur in practice are simulated based on prepared scenarios. They usually take place in several rounds. Found bugs are Ąxed and in the following rounds, these Ąxes are tested again. User acceptance testing: If all the previous stages of the tests are completed without major shortcomings the application can be given to the customer. The customer then usually performs acceptance tests with their team of testers. Found discrepancies between the application and speciĄcations are reported back to the development team. Fixed bugs are deployed to the customer’s environment. 3 2. Overview of approaches to website testing 2.2 Automated testing Automated testing is a process of automating manual tests using automated instruments such as Selenium. Automated testing has several advantages over manual testing. It prevents errors where a part of the test is left out. In automated testing, the same code is always performed so there is no room for human error, such as a bad entry into the input Ąeld.

System for Detection of Websites with Phishing and Other Malicious Content

Computing Fundamentals and Office Productivity Tools It111

Chrome Devtools Protocol (CDP)

Test Driven Development and Refactoring

Automated Testing Clinic Follow-Up: Capybara-Webkit Vs. Poltergeist/Phantomjs | Engineering in Focus

Selenium Python Bindings Release 2

Automated Testing of Your Corporate Website from Multiple Countries with Selenium Contents

Client-Side Diversification for Defending Against

Interstitial Content Detection Arxiv:1708.04879V1 [Cs.CY] 13 Aug

Instrumentation De Navigateurs Pour L'analyse De Code Javascript

Casperjs Documentation Release 1.1.0-DEV

Written Testimony of Keith Enright Chief Privacy Officer, Google United

Guideline for Securing Your Web Browser P a G E | 2