Applicability of Machine Learning in Spam and Phishing Email Filtering: Review and Approaches

Artificial Intelligence Review manuscript No. (will be inserted by the editor) Applicability of Machine Learning in Spam and Phishing Email Filtering: Review and Approaches Tushaar Gangavarapuy;? · Jaidhar C.D.z · Bhabesh Chandukaz Received: 13 November, 2018 / Revised: 22 January, 2020 / Accepted: 29 January, 2020 Abstract With the influx of technological advancements and the increased sim- plicity in communication, especially through emails, the upsurge in the volume of Unsolicited Bulk Emails (UBEs) has become a severe threat to global security and economy. Spam emails not only waste users' time, but also consume a lot of network bandwidth, and may also include malware as executable files. Alternatively, phishing emails falsely claim users' personal information to facilitate identity theft and are comparatively more dangerous. Thus, there is an intrinsic need for the de- velopment of more robust and dependable UBE filters that facilitate automatic detection of such emails. There are several countermeasures to spam and phishing, including blacklisting and content-based filtering. However, in addition to content- based features, behavior-based features are well-suited in the detection of UBEs. Machine learning models are being extensively used by leading internet service providers like Yahoo, Gmail, and Outlook, to filter and classify UBEs successfully. There are far too many options to consider, owing to the need to facilitate UBE detection and the recent advances in this domain. In this paper, we aim at eluci- dating on the way of extracting email content and behavior-based features, what features are appropriate in the detection of UBEs, and the selection of the most discriminating feature set. Furthermore, to accurately handle the menace of UBEs, we facilitate an exhaustive comparative study using several state-of-the-art machine learning algorithms. Our proposed models resulted in an overall accuracy of This is a post-peer-review, pre-copyedit version of an article published in Artificial Intelli- gence Review. The final authenticated version is available online at: https://doi.org/10. 1007/s10462-020-09814-9. ?Corresponding author. (T. Gangavarapu completed most of this work at the National Institute of Technology Karnataka, India.) yAutomated Quality Assistance (AQuA) Machine Learning Research, Content Experi- ence and Quality Algorithms, Amazon.com, Inc., India. E-mail: [email protected] (T. Gangavarapu) zDepartment of Information Technology, National Institute of Technology Karnataka, Surathkal, Mangaluru, 575025, India. 2 T. Gangavarapu et al. 99% in the classification of UBEs. The text is accompanied by snippets of Python code, to enable the reader to implement the approaches elucidated in this paper. Keywords Feature Engineering · Machine Learning · Phishing · Python · Spam 1 Introduction Digital products and services increasingly mediate human activities. With the ad- vent of email communication, unsolicited emails, in recent years, have become a serious threat to global security and economy [11]. As a result of the ease of communication via emails, a vast number of issues involving the exploitation of technology to elicit personal and sensitive information have emerged. Identity theft, being one of the most profitable crimes, is often employed by felons to lure unsus- pecting online users into revealing confidential information such as social security numbers, account numbers, and passwords. Unsolicited emails disguised as coming from legitimate and reputable sources often attract innocent users to fraudulent sites and persuade them to disclose their sensitive information. As per the report by Kaspersky Lab, in the first quarter of 2019, the menace of such unwanted emails was responsible for 55:97% of traffic (0:07% more than that in the fourth quarter of 2018). Unsolicited Bulk Emails (UBEs) can be broadly categorized into two distinct yet related categories: spam and phishing. Spam emails are essentially UBEs that are sent without users' consent, primar- ily for marketing purposes such as selling unlicensed medicines, illegal products, and pornography [86]. The growth of spam traffic is a worrisome issue as such emails consume a lot of network bandwidth, waste memory and time, and cause financial loss. Phishing emails, on the other hand, are a much more serious threat that involves stealing individuals' confidential information such as bank details, social security numbers, and passwords. Most of the phishing attacks are focused towards financial institutions (e.g., banks); however, attacks against government institutions, although not as targeted, cannot be overlooked [11]. To understand the impact of phishing, consider pharming, a variant of phishing, where the at- tackers misdirect users to fraudulent sites through domain name server hijacking [2]. The effect of spam and phishing on valid users is multi-fold: { Generally, UBEs promote products and services with little real value, pornography, get-rich-quick schemes, unlicensed medicines, dicey legal services, and potentially illegal offers and products. { UBEs often hijack real users' identities to send spam to other users (e.g., business email compromise scams such as email spoofing and domain spoofing (≈ amounted to almost $1:3 billion in 2018 (20; 373 victims), which was twice as much as that in 2017 (15; 690 victims) [1])). { Phishing, in particular, involves identity theft as financial identity theft, crim- inal identity theft, identity cloning, or business/commercial identity threat. { Mailing efficiency and recipient's productivity are drastically affected by UBEs. A study by the McKinsey Global Institute revealed that an average person spends 28% of the workweek (≈ 650 hours a year) reading and responding to emails [28]. Additionally, research on SaneBox's internal data revealed that only 38% of the emails on an average are relevant and important [28], equivalent to ≈ 11% of the workweek. Furthermore, a study by the Danwood Group found that it Machine Learning in UBE Filtering: Review and Approaches 3 takes an average of 64 seconds to recover from an email interruption and return to work at the rate before the interruption [28]|adversely affecting the recipients' productivity, especially in the case of irrelevant UBEs. Based on the Kaspersky Lab report, in 2015, the UBE email volume fell by 50% for the first time since 2003 (≈ three to six million). Such decline was attributed to the reduction (in billions) of major botnets responsible for spam and phishing. Conversely, by the end of 2015, the UBE volume escalated. Furthermore, Kaspersky spam report revealed an increase in the presence of pernicious email attachments (e.g., malicious macros, malware, ransomware, and JavaScript) in the spam email messages. By the end of March 2016, the UBE volume (≈ 22; 890; 956) had quadrupled in comparison with that witnessed in 2015. In 2017, the Internet Security Threat Report (ISTR) [84] estimated that the volume of spam emails had skyrocketed to an average of 55% (≈ 2% more than that in 2015 (52:7%) and 2016 (53:4%)). Clearly, spam and phishing rates are rapidly proliferating. The overall phishing rate in 2017, according to the ISTR [84], is nearly one in every 2; 995, while the number of Uniform Resource Locators (URLs) related to phishing rose by 182:6%, which accounted for 5:8% (one in every 224) of all malicious URLs. Over the years, extensive research in this domain revealed several plausible countermeasures to detect UBEs. Approaches such as secure email authentication result in high administrative overload and hence, are not commonly used. Ma- chine learning and knowledge engineering are two commonly used approaches in filtering UBEs. In knowledge engineering, UBEs are classified using a set of predefined rules. However, knowledge engineering approaches require constant rule updation to account for the dynamic nature of the UBE attacks|often suffer from scalability issues. In machine learning approaches, the algorithm itself learns the classification rules based on a training set|determining the email type through the analysis of the email content and structure has emerged, owing to the success of AI-assisted approaches in UBE classification. This area of research is actively being developed to account for the dynamic nature of UBE attacks. Past works in the existing literature explore several informative features, and many machine learning algorithms have been developed and utilized to classify the incoming mail into junk and non-junk categories [86,19,85,58,27,79]. Many leading internet service providers including Yahoo mail and Gmail, employ a combination of machine learning algorithms such as neural networks, to handle the threat posed by UBE emails effectively. Since machine learning models have the capacity to adapt to varying conditions, they not only filter the junk emails using predefined rules but also generate new rules to adapt to the dynamic nature of the UBE attack. Despite the success, adaptability, and predictability of machine learning models, preprocessing, including feature extraction and selection plays a critical role in the efficacy of the underlying UBE classification system [87,57]. Thus, there is a need to determine the most discriminative and informative feature subset that facilitates the classification of UBEs with a higher degree of confidence. Due to the vast heterogeneity in the existing literature, there is no consen- sus on which features form the most informative and discriminative feature set. Moreover, to the best of our knowledge, only a

Applicability of Machine Learning in Spam and Phishing Email Filtering: Review and Approaches

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support