A Keyword-Based Combination Approach for Detecting Phishing Webpages
Total Page:16
File Type:pdf, Size:1020Kb
computers & security 84 (2019) 256–275 Available online at www.sciencedirect.com j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / c o s e A keyword-based combination approach for detecting phishing webpages ∗ Yan Ding a, Nurbol Luktarhan a, , Keqin Li b, Wushour Slamu a a College of Information Science and Engineering, Xinjiang University, Urumqi, China b Department of Computer Science, State University of New York, New Paltz, New York, USA a r t i c l e i n f o a b s t r a c t Article history: In this paper, the Search & Heuristic Rule & Logistic Regression (SHLR) combination detec- Received 16 January 2018 tion method is proposed for detecting the obfuscation techniques commonly used by phish- Accepted 9 November 2018 ing websites and improving the filtering efficiency of legitimate webpages. The method is Available online 23 March 2019 composed of three steps. First, the title tag content of the webpage is input as search key- words to the Baidu search engine, and the webpage is considered legal if the webpage do- Keywords: main matches the domain name of any of the top-10 search results; otherwise, further eval- Heuristic rule uation is performed. Second, if the webpage cannot be identified as legal, then the webpage Machine learning is further examined to determine whether it is a phishing page based on the heuristic rules Phishing defined by the character features. The first two steps can quickly filter webpages to meet Search engine the needs of real-time detection. Finally, a logistic regression classifier is used to assess the URL obfuscation techniques remaining pages to enhance the adaptability and accuracy of the detection method. The experimental results show that the SHLR can filter 61.9% of legitimate webpages and iden- tify 22.9% of phishing webpages based on uniform/universal resource locator (URL) lexical information. The accuracy of the SHLR is 98.9%; thus, its phishing detection performance is high. © 2019 Elsevier Ltd. All rights reserved. users, and the annual economic losses caused by phishing 1. Introduction websites amount to several hundred million dollars. Phishing attacks are cybercrimes that steal user privacy 1.1. Phishing data via social engineering or technical methods. Attackers use e-mail or text messages to send false security warn- In 2017, the “Double 11” shopping carnival experienced a ings or prize information to trick users into clicking on the record high number of sales that amounted to hundreds of phishing webpage and submitting critical personal informa- billions of Chinese e-commerce transactions. The continuous tion. Phishing attacks can lead to a series of serious problems development of e-commerce has played a vital role in pro- for users, such as stolen bank account information, which moting the world economy and satisfying the needs of the can cause huge financial losses. Moreover, the use of iden- population. However, during this period, phishing webpages tity information to substantiate false information will neg- and other online fraud behavior have also shown rapid growth atively affect the users credibility. Phishing webpages and trends. Phishing webpages, a form of cyberattack, seriously their uniform/universal resource locators (URLs) are the pri- threaten the credibility and financial security of Internet mary sources used by phishing attackers to steal private data ∗ Corresponding author. E-mail addresses: [email protected] (Y. Ding), [email protected] (N. Luktarhan), [email protected] (K. Li), [email protected] (W. Slamu). https://doi.org/10.1016/j.cose.2019.03.018 0167-4048/© 2019 Elsevier Ltd. All rights reserved. computers & security 84 (2019) 256–275 257 Fig. 1 – Phishing attack flow chart. from users. According to the 2016 China Internet Security Re- 1. General phishing attacks are attacks that are not targeted port ( Qihoo360, 2017 ) released by the 360 Internet Security to specific subjects. The attacker needs only to send the Center, 1,199,000 new phishing websites were intercepted in phishing URL to the user’s terminal through various means 2016 compared with 2015, representing an increase of 25.5% and does not set a targeted scam plan. Over time, phishing (1,569,000). Over 225 phishing websites are created on average attacks have become easy to conduct because of the evo- every hour. The scale of phishing webpages and their URLs lution of the cybercrime market. An attacker can buy and shows a continuously growing trend. sell tools for phishing kits, and these kits are usually priced With the advent of the big data era, phishing attacks have at only $2–$10. Moreover, buyers do not need specialized become more sophisticated. To increase the probability of expertise to execute the attack. Because obvious flaws are attack success, attackers collect and analyze various data exposed with general phishing attacks, the probability of information that users leave on the Internet, such as their success is relatively low. name, gender and contact information, and even basic infor- 2. Spear phishing attacks are attacks on a specific class of mation of the users’ family members and social relations. As a subjects. Because such users are often the key information result, phishing e-mails appear to be authentic and trustwor- holders of a business or a group, the attack relies on well- thy, making users more likely to fall into the trap of phishing planned social scams. An attacker sends false messages when they receive messages that appear relevant. According to the targets. If the attack does not work within a certain to Symantec’s Internet Security Threat Report ( Symantec, time frame, the attacker will launch other types of exploits 2017 ), business email compromise (BEC) scams using spear against targets with the same background. However, in re- phishing tactics attack more than 400 companies a day and cent years, attackers have begun to change tactics to avoid have stolen more than $3 billion over three years. According exposure after many failures. The number and sophisti- to the Anti-Phishing Working Group (APWG) statistical report cation of spear phishing attacks have increased substan- ( APWG, 2017 ), a total of 255,065 phishing attacks against tially compared with those of general phishing attacks. In different enterprises or entities were detected globally in 2015, 34.9% of all spear phishing e-mails were sent to finan- 2016, representing an increase of 10% from 2015. The effective cial enterprises, and financial industry enterprises had at detection of phishing webpages is a key measure to maintain least an 8.7% likelihood of experiencing at least one attack Internet security. throughout the year ( Symantec, 2016 ). 3. Whale phishing attacks are phishing attacks against corporate CEOs and political leaders. Attacker vectors use infected e-mails, collect and analyze users’ online and 1.2. Phishing categories offline information, and exploit various cyber vulnerabil- ities to develop a detailed attack strategy. Whale phishing As shown in Fig. 1 , phishing attackers need to perform has similarities to spear phishing, with both targeting preparatory work before conducting phishing attacks, includ- specific groups of people; however, the losses caused by ing selecting subjects and methods of dissemination and cre- whale phishing are huge and catastrophic. For example, ating fake webpages. A reasonable classification of phishing in 2016, Walter Stephan, the CEO of the Austrian aircraft attacks can aid the adoption of effective identification mea- parts maker FACC caused a $47 million loss because of a sures ( Aleroud and Zhou, 2017 ). Therefore, this paper classifies phishing e-mail. phishing attacks based on key processes used in the attack. First, phishing attacks can be divided into the following three Building a phishing webpage is a key part of a phishing types according to the target of the attack. attack, and how to disseminate URLs to clients is one issue 258 computers & security 84 (2019) 256–275 that attackers must consider. The methods of disseminating can be divided into four parts, the domain, path, file, phishing webpages can be divided into three categories. and parameters. Many unrelated words are included in phishing URLs; thus, phishing URLs present certain dif- 1. E-mail. Although instant messaging technology is gaining ferences in their characters with respect to legitimate popularity among corporate and personal user groups, e- URLs ( Mcgrath and Gupta, 2008 ). For example, Ma et al. mail still dominates the field of digital communications. (2009, 2011) extracted a number of features, including E-mail represents a simple and affordable dissemination the length of the URL string, the number of domain method to spread phishing pages. In 2016, one out of every labels, and the number of segmented characters to 2596 e-mails was a phishing attempt ( Symantec, 2017 ). detect phishing webpages. 2. Short Message Service (SMS). In 2016, 360 mobile guards 2) Domain. The domain name is a unique identification blocked 17.35 billion spam messages across the China, of that consists of two or more words separated by a which fraud messages ranked second and accounted for period. The domain can be divided into three levels 2.8% of the total spam messages ( Qihoo360, 2017 ). However, based on location, with the word at the far right of because of the development of mobile Internet technology, the domain name representing the top-level domain, users have become reliant on mobile devices, such as mo- which includes country names, such as.cn representing bile phones; therefore, SMS represents a major method of China’s top-level domain, and top-level international delivering phishing URLs. domain names, such as.com (for business) and.org (for 3. Others. Voice phishing is an attack technique that uses non-profit organizations).