AN INTROSPECTIVE BEHAVIOR BASED METHODOLOGY TO MITIGATE E-MAIL BASED THREATS

BY MADHUSUDHANAN CHANDRASEKARAN

THESIS

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering in the Graduate School of the State University of New York at Buffalo, 2009

Buffalo, New York © Copyright by Madhusudhanan Chandrasekaran, 2009 All Rights Reserved To my family.

iii Abstract

E-mail is touted as the backbone of present day communication. Despite its convenience and im- portance, existing e-mail infrastructure is not devoid of problems. The underlying e-mail protocols operate on the assumption that users would not abuse the privilege of sending messages to each other. This weakness in design is consistently taken advantage by attackers to carry out social engineering and security exploits on day-to-day e-mail users. As a result, three prominent e-mail based threats have surfaced, viz. (i) spam; (ii) ; and (iii) information leak. While spam e-mail classification has received a lot of attention in the recent years, the other two threats still loom at large. The main goal of this dissertation is to design and develop efficient behavior based classification techniques that help to address each of these threats in a piecemeal fashion. The first part of this dissertation attempts to tackle the problem of detecting phishing e-mails before they reach users’ inboxes. To begin with, shortcomings of existing spam filters toward clas- sifying phishing e-mails are highlighted. To overcome them, a customizable and usable spam filter (CUSP) that detects phishing e-mails from the absence of personalized user information contained in them is proposed. However, as solely relying on the presence of personalized information as the criteria to detect phishing e-mails is not entirely foolproof, a novel machine learning based classi- fier that separates phishing e-mails based on their underlying semantic behavior is proposed. Ex- perimentation on real word phishing and financial e-mail datasets demonstrates that the proposed methodology can detect phishing e-mails with over 90% accuracy while keeping false positive rate minimum. Also, feasibility of generating context-sensitive warnings that better educate the users about the ill-effects of phishing attacks is explored. Classification techniques that operate on features confined to the phishing e-mails’ body can be

iv thwarted using simple obfuscation techniques, which substitute spurious content appearing in them with seemingly innocuous characters or images. To address such scenarios, the second part of this dissertation takes the classification process a step further to analyze the behavior and characteris- tics of Websites referred by URLs contained in e-mails. Specifically, a challenge-response based technique called PHONEY is proposed to detect phishing Websites based on their inability to dis- tinguish fake and genuine inputs apart. Experimental results based on evaluation on both “live” and “synthesized” phishing Websites reveal that PHONEY can detect almost of all the e-mails that link to live phishing Websites with zero false positives and minimal computation overhead. In a similar vein, this dissertation proposes a novel technique to identify spam e-mails by analyzing the content of the linked-to Websites. A combination of textual and structural features extracted from the linked-to Websites is supplied as input to five machine learning algorithms employed for the purpose of classification. Testing on live spam feeds reveal that the proposed technique can detect spam e-mails with over 95% detection rate, thereby exhibiting better performance than two popular open source anti-spam filters. Information leaks pose significant risk to users’ privacy. An information leak could reveal users’ browsing characteristics or sensitive material contained in their e-mail inboxes to attackers allowing them to launch more targeted social engineering attacks (e.g., spear phishing attacks). The third part of this dissertation focuses on addressing these two facets of information leaks, i.e., information leak trigerred by and user by detailing out the limitations with the state-of-the- art detection techniques. In order to bring out the deficiencies in existing anti-spyware techniques, first, a new class of intelligent spyware that efficiently blends in with user activities to evade de- tection is proposed. As a defensive countermeasure, this dissertation proposes a novel randomized honeytoken based methodology that can separate normal and spyware activities with near perfect accuracy. Similarly, to detect inadvertent informational leaks caused by users sending misdirected e-mails to unintended recipient(s), this dissertation advances the existing bag-of-words based out- lier detection techniques by using a set of stylometric and linguistic features that better encapsulate the previously exchanged e-mails between the sender and the recipient. Experimentation on real

v world e-mail corpus shows that the proposed technique detects over 78% of synthesized informa- tion leak outperforming other existing techniques. Another important point to be considered while devising specialized filters to address each of the e-mail based threat is the need to make them interoperable. For example, an e-mail sup- posedly sent from a financial domain, but having an URL referring to a domain blacklisted for spam is very likely a phishing e-mail. Identifying sources of attacks helps in developing attack agnostic solutions that block all sensitive communication from and to misbehaving nodes. From this perspective, this dissertation explores the feasibility of building a holistic framework that not only operates in conjunction with intrusion detection systems (IDS) to block incoming and outgo- ing traffic from and to misbehaving nodes, but also safeguard the underlying e-mail infrastructure from zero-day attacks.

vi Acknowledgments

My advisor, Dr. Shambhu Upadhyaya, deserves many thanks. Under his tutelage, I could trans- form my otherwise loose ideas into something concrete as this dissertation. From onset, he actively involved me in various research projects and meetings, which helped me in building and strength- ening my academic outlook. It is a great pleasure to have worked under him in the end. I would also like to thank my committee members – Dr. Hung Ngo and Dr. Sheng Zhong for their support and guidance. Dr. Ngo had been a committee member for my Masters thesis also. Dr. Ngo is a great teacher, and is ever willing to embrace new ideas and provide constructive criticisms. I am indebted for the advise he has given me on both professional and personal front. Taking a seminar and independent study under Dr. Zhong was a fruitful and invigorating experience. Every discussion with him was thorough and in-depth, always imparting something in the end. I would like to thank my mentor at Google, Dr. Arash Baratloo, for providing me the insight on transforming research grade ideas into complete tangible products. Despite his hectic schedule, he took time to read my papers and sit through my presentations to provide invaluable suggestions. In a similar vein, I would like to thank Dr. Richard Wasserman and Ms. Maureen (Cheshire)Dantzler for giving me an opportunity to work in the Transaction Risk Management (TRMS) department at Amazon Inc. It is where I managed to get a “sneak view” on anti- life-cycle from detection to take down of fraudulent sellers in an industry setting. My stay at Buffalo has been a pleasant and productive experience. This, however, would not have been possible without the fun and frolic that I had with my labmates and housemates during the last several years. I thoroughly enjoyed engaging conversations and eat-out breaks I had with my friends including, Aarthie, Anusha, Ashish, Duc, Mohit, Murtuza, Ram, Sunu, Suranjan and

vii Vinod. I would also like to thank Vidyaraman (Video) for agreeing to be my agreeable roommate both at home and school. On a personal note, I would like to thank my wife Anusuya. Her unwavering love and enduring support helped me paddle my way through murky situations. I hope that I have not dragged her too far in the process. I also thank her for helping me to proofread this dissertation without which it would not be in its current form. I would like to thank my grandparents, parents and brother for believing in me – even when I did not. My mother made me realize that there is much more to life than a “bookish” degree. I pray that I never wane the trust my family have bestowed upon me. Thanks to my cousins Vijay and Aarthi for being with me during the final stages of this dissertation. Last but not the least, I would like to thank all the faculty, staff, and students of computer science and engineering department with whom I interacted some point in time during my stay here.

viii Table of Contents

Abstract ...... iv

Acknowledgments ...... vii

Chapter 1 Introduction ...... 1 1.1 E-mail Communication ...... 1 1.2 What Makes Secure Communication Hard? ...... 3 1.2.1 Authentication ...... 3 1.2.2 Integrity ...... 4 1.2.3 Non-repudiation ...... 6 1.2.4 Problems with Existing E-mail Security Enforcement Schemes ...... 7 1.3 E-mail-based Security Threats ...... 9 1.4 Dissertation Scope ...... 11 1.5 Original Contributions ...... 15 1.6 Dissertation Outline ...... 16

Chapter 2 Background and Related Work ...... 18 2.1 Introduction ...... 18 2.1.1 Chapter Organization ...... 20 2.2 Phishing E-mail Detection ...... 21 2.2.1 Discussion ...... 24 2.3 Validating E-mails Through Referral Webpages Analysis ...... 24 2.3.1 Discussion ...... 27 2.4 Preventing Information leak in Emails ...... 28 2.4.1 Discussion ...... 30 2.5 Summary ...... 31

Chapter 3 Detection of Phishing E-mails Based on Structural and Linguistic Features 32 3.1 Introduction ...... 32 3.1.1 Contributions ...... 36 3.1.2 Chapter Organization ...... 37 3.2 Commonly Adopted Phishing Attack Vectors ...... 37 3.3 Why Existing Anti-spam Based Approaches Fail? A Case Study ...... 39 3.3.1 Performance of Current Anti-spam Filters in Tackling Phishing Attacks . . 40 3.4 CUSP: Customizable and Usable Spam Filters to Detect Phishing E-Mails . . . . . 42

ix 3.4.1 Overview of CUSP ...... 44 3.5 Anatomy of Phishing E-Mails ...... 48 3.6 Using Machine Learning Algorithms to Classify Phishing E-mails ...... 51 3.6.1 Features Used in Classifying Phishing E-Mails ...... 52 3.6.2 Detection Algorithms ...... 55 3.6.3 Experimentation ...... 59 3.6.4 Experimental Setup ...... 66 3.7 Context Sensitive Warning Generation ...... 69 3.7.1 Experiences with Context-Sensitive Warning Generation ...... 73 3.8 Summary ...... 75

Chapter 4 Detecting Spurious E-mails through Linked-to Website Analysis ...... 76 4.1 Introduction ...... 76 4.1.1 Contributions ...... 79 4.1.2 Chapter Organization ...... 80 4.2 Overview of PHONEY ...... 80 4.2.1 Background ...... 80 4.2.2 Scope of Detection ...... 82 4.2.3 Working of PHONEY ...... 83 4.2.4 Architecture Details ...... 84 4.2.5 Design Criteria ...... 85 4.2.6 Implementation ...... 86 4.2.7 Evaluation of PHONEY ...... 87 4.2.8 Limitations of PHONEY ...... 93 4.2.9 Potential Improvements to PHONEY ...... 94 4.3 Classifying Spam E-mails based on linked-to Webpages Analysis ...... 95 4.3.1 Approach Overview ...... 95 4.3.2 Scope of Detection ...... 97 4.3.3 Features Used In Detection ...... 97 4.3.4 Classification Algorithms ...... 100 4.3.5 Experimentation ...... 103 4.4 Summary ...... 105

Chapter 5 Detecting Information Leak Based E-mail Threats ...... 106 5.1 Introduction ...... 106 5.1.1 Contributions ...... 110 5.1.2 Chapter Organization ...... 110 5.2 Spycon: Emulating User Activities to Detect Evasive Spyware ...... 111 5.2.1 Background ...... 111 5.2.2 Problem Statement ...... 113 5.2.3 Design of SpyZen ...... 115 5.2.4 Design of SpyCon ...... 118 5.2.5 Experimentation ...... 119 5.2.6 Implementation Issues ...... 121

x 5.3 Preventing Information Leak in E-mails Using Structural and Stylometric features . 123 5.3.1 Background ...... 124 5.3.2 Scope of Detection ...... 125 5.3.3 Features Used in Classification ...... 126 5.3.4 Experimentation ...... 126 5.4 Summary ...... 130

Chapter 6 Automated Vulnerability Aggregation and Response against Zero-day At- tacks ...... 131 6.1 Introduction ...... 131 6.1.1 Motivation ...... 132 6.1.2 Contributions ...... 134 6.1.3 Chapter Organization ...... 134 6.2 Design Goals ...... 135 6.3 Architecture Overview ...... 136 6.4 Experimental Evaluation ...... 142 6.4.1 Feasibility to Generate NIDS Rules ...... 142 6.4.2 Performance Overhead ...... 144 6.5 Summary ...... 146

Chapter 7 Conclusions and Future Work ...... 147 7.1 Concluding Remarks ...... 147 7.2 Future Directions ...... 151

References ...... 153

Vita ...... 168

xi List of Figures

1.1 Is the e-mail by an authorized sender? ...... 5 1.2 An illustration of different authentication schemes used to identify unauthorized senders ...... 5 1.3 Is the e-mail forged or modified in transit? ...... 6 1.4 An illustration of e-mail encryption to prevent illegitimate message modification during transit ...... 6 1.5 Can an e-mail be linked to a sender so that he cannot repudiate it? ...... 7 1.6 Enforcement of non-repudiation policy in e-mail through digital signature or audit database ...... 7 1.7 E-mail-based threats that can be mitigated by incorporating authentication, in- tegrity and non-repudiation properties in e-mail infrastructure ...... 10

3.1 Steps involved in e-mail based phishing attack ...... 33 3.2 Breakup summary of various financial institutions contributing towards the 577 hard ham dataset ...... 41 3.3 False positive rates of popular anti-spam filters while detecting phishing e-mails. The filters were tested against 577 hard bank e-mails ...... 41 3.4 CUSP requesting the user to enter private data corresponding to the subscribed institution ...... 46 3.5 A modal box interrupting the user to enter the required user specific data before opening the e-mail ...... 47 3.6 Anatomy of a spoofed Paypal phishing e-mail illustrating various intrinsic fea- tures used to deceive users ...... 50 3.7 A 2-dimensional Binary Classification with Linear SVM ...... 57 3.8 An example decision tree to classify phishing e-mails ...... 58 3.9 Preprocessing and sanitization of e-mail dataset before classification ...... 61 3.10 Effect of each individual feature in total fraction of e-mails and phishing e-mails . . 63 3.11 Relationship between the frame Taking_Time and other abstract frames represent- ing semantic roles that specify time boundedness in FrameNet ...... 71 3.12 Context-sensitive warning explaining the tone of the phishing e-mail to the user . . 74

4.1 An example illustrating inefficiency of OCRs to parse out text in image based spurious e-mails. The output of GNU OCR (gocr) when applied on an image based HSBC phishing e-mail is shown here...... 77

xii 4.2 An illustration of phishing Website fabricated to attack PayPal users. The dis- played URL is clearly different from that of PayPal Website...... 81 4.3 Defense-centric view: Who is the real sender - legitimate or adversary? ...... 83 4.4 Offense-centric view: Who is the real respondent - the real victim or a PHONEY? . 83 4.5 Block diagram of PHONEY architecture ...... 84 4.6 Phantom users supplying fake login information to the spoofed Website ...... 89 4.7 The detection engine flags the Website malicious ...... 89 4.8 Phantom user instantiation overhead incurred while testing PHONEY against live phishing Websites ...... 92 4.9 Response analysis overhead incurred while testing PHONEY against live phishing Websites ...... 92 4.10 An example showing KNN classification with k = 3. The red denotes the test samples (x~?) after the assignment of correct class labels. The symbols and + denote the two class labels used in classification ...... 101 4.11 A neural network with feedforward links and three layers ...... 102

5.1 Working of honeytoken based spyware detection mechanisms ...... 113 5.2 Gmail client showing the list of suggested recipients during message composition. . 124

6.1 Architecture Block Diagram of AEGIS ...... 137 6.2 An Example Vulnerability Report in EDORS Format ...... 139 6.3 Performance overhead incurred by SNORT to generate IDS rules ...... 146

xiii Chapter 1

Introduction

“Bogus e-mails that try to trick customers into giving out personal information are the hottest, and most troubling, new scam on the Internet.”

− Jana Monroe, Assistant Director, Cyber Division of FBI

1.1 E-mail Communication

Electronic mail is the backbone of present day communication. Ever since its modest beginning from being a medium for users of time-shared machines to communicate, e-mail† has come a long way predating the Internet. With the advent of the Internet, e-mail’s popularity has also grown tremendously. It has become possible for the Internet users to communicate with each other using e-mail, independent of service providers or e-mail clients. This in fact has attributed to e-mail being touted as one of the “killer app” of the computing era. A recent study estimates that the number of e-mail users are anticipated to surge from 1.5 billion today to over 2 billion in 2011 [100]. Most modern e-mail systems adopt a store-and-forward model where the task of relaying, for- warding, receiving, and storing e-mails is delegated to e-mail servers. Users connect to the servers only for a short duration to either send or receive messages. Historically, there were many dis- parate and incompatible protocols governing message transmission. In order to foster massive adoption and interoperability, these protocols were standardized into Simple Message Transfer

†the terms eMail, e-mail and email are used interchangeably to refer to electronic mail

1 Protocol (SMTP) defined in RFC 821 [97] and RFC 822 [39] in 1982. To send an e-mail, a user’s client, also known as Mail User Agent (MUA), relays the message to the service provider’s (or local) SMTP server, also known as relaying Mail Transfer Agent (MTA)‡. The relaying MTA then obtains each recipient domain’s SMTP server address from Domain Name System (DNS) mail ex- change (MX) records. The message transfer between relaying MTA and recipient’s SMTP server, also known as receiving MTA, occurs through a series of queries and responses known as SMTP commands. Depending on the relaying SMTP configuration, it may be possible that the message is routed through a series of intermediate MTAs. SMTP is a push protocol; the message is delivered to receiving MTA, and not to recipients’ e-mail clients. Recipients can fetch messages by either directly accessing the server or remotely from their clients using Internet Message Access Protocol (IMAP) or Post Office Protocol (POP). Since SMTP was primarily designed for text-based com- munication, it requires that messages are composed using 7-bit ASCII characters. Unfortunately, such a restriction prohibits embedding of certain international language characters or attachments in 8-bit binary content such as computer programs, media files, images, documents, etc., in the message body. To alleviate this limitation, Multipurpose Internet Mail Extensions (MIME) was proposed to extend the format of e-mail to support 8-bit binary content by encoding them as 7-bit ASCII. The flexibility brought about by MIME has enabled users to employ e-mail as an envelope to transfer attachments in varied formats. Consequently, sensitive and time critical information such as bank account statements, e-commerce receipts, medical records, tax documents, and other private documents are often exchanged via e-mails. Despite its convenience and importance, existing e-mail infrastructure is not devoid of prob- lems. SMTP is based on the assumption that users would not abuse the privilege of sending mes- sages to each other. As a result, SMTP does not curb unauthorized parties from sending messages. This weakness in design is consistently exploited by spammers to “pump-and-dump” large volume of unsolicited e-mails into unwelcoming recipients’ inboxes. Even though spam e-mails are re- garded as nuisance, they do not pose a threat to users’ security and privacy. On the other hand, lack

‡it is possible for a client to be configured as both MUA and MTA

2 of proper verification mechanisms in SMTP has made it surprisingly easy for fraudsters to forge e- mails to launch vicious social engineering attacks that steal user’s private information and identity. E-mail-based social engineering attacks can also lure recipients into executing potentially harm- ful malware such as worm and virus that are embedded as attachments. The situation is further aggravated because malicious senders can disown e-mails sent from their accounts at a later point in time, thus evading consequences of their actions. Since these limitations are ingrained in the existing e-mail infrastructure fixing them is not straightforward. Essentially, it requires addressing the following three fundamental issues that form the basis of secure communication:

• How to prevent unauthorized senders from sending e-mail messages? In other words, this problem boils down to identifying spurious e-mails that inundate users’ inboxes.

• How to distinguish legitimate and spoofed e-mails that purport to originate from the same domain apart?

• How to make malicious senders accountable so that they cannot repudiate e-mails sent from their accounts?

A large number of mechanisms have been proposed to Internet Engineering Task Force (IETF) by security practitioners and industry personnel, which attempts to address the aforementioned issues. The next section highlights the feasibility of these solutions by listing their pros and cons.

1.2 What Makes Secure Communication Hard?

In order for parties to exchange messages in a secure fashion, the underlying communication medium should possess the following attributes:

1.2.1 Authentication

E-mail authentication is the process by which recipients can accurately link an e-mail back to its sender (or sender’s MTA). If an e-mail is determined to originate from an unauthorized domain

3 or account, recipients could then configure their receiving MTAs or e-mail clients to block all e- mails sent from them, thus thwarting e-mail and spam. Lack of authentication mechanisms in SMTP, however, is considered as Achilles heel of the current e-mail system. The decision to identify unauthorized senders is typically thrust on the recipient as shown in Figure 1.1. Several extensions to SMTP have been proposed for the purpose of incorporating authentication mech- anisms into it. Secure SMTP [59] is one such effort that requires senders to be authenticated by the relaying SMTP server, and supports Transport Layer Security (TLS) for secure transmis- sion of messages over the Internet. Other extensions (Sender Policy Framework (SPF) [124], SenderID [78], and certified sender validation (CSV) [38]) have been proposed that require the senders to publish beforehand the IP addresses of the machines that would be used to send out e- mails as part of their domain’s DNS record. Subsequently, a recipient can reject the messages from unauthorized machines that are not listed in the sending domains’ DNS records. Domain Keys Identified Mail (DKIM) [9] is another protocol that uses public key cryptography to digitally sign and encrypt message header and body, thus providing both authentication and end-to-end message integrity. DKIM also alleviates key exchange problem by making senders publish their public key as part of their domains’ DNS records. Other third-party public key cryptography schemes such as S/MIME [102], PGP [135], GnuPGP [115] and identity based encryption protocols [50] exist that provide end-to-end encryption without any modifications to the underlying SMTP protocol. The main difference between DKIM and these protocols is that, unlike DKIM, these protocols re- quire the recipients to be bootstrapped with the senders’ public keys. Figure 1.2 summarizes these schemes by illustrating the point of operation at which authentication takes place.

1.2.2 Integrity

As opposed to authentication, integrity establishes the validity of an e-mail message rather than its sender’s legitimacy. To achieve this, integrity verification schemes need to safeguard against adversaries who modify the entire or parts of a message in transit. In other words, these schemes should provide end-to-end protection so that any modifications to e-mail messages by malicious

4 The red line indicates the path taken for authentication

SMTP AUTH, The green line indicates Who is the Secure SMTP legitimate message path authorized Eve Bob sender? [email protected] The red line the path of spurious e-mail messages SMTP Server Receiving [email protected] MTA Bob DNS Server Alice

Bob Receiving Alice SMTP MTA Domain Keys, Server SPF,CSV,SenderID [email protected] Digital Signatures Schemes Such as S/MIME, Eve PGP, GPG, Identity Based Encryption

Figure 1.2: An illustration of different authen- tication schemes used to identify unauthorized Figure 1.1: Is the e-mail by an authorized sender? senders

intermediate MTAs can immediately be identified. Apart from modifying a legitimate message, a fraudster could also choose to send a spoofed e-mail impersonating it, thus violating its integrity. For the sake of better understanding, these two possible scenarios are pictorially represented in Fig- ure 1.3. The public key cryptography schemes used in e-mail authentication can also be employed for preserving messages’ integrity as shown in Figure 1.4. Essentially, these schemes support dig- ital signing of messages so that recipients can determine if they were modified in transit or not. To digitally sign an e-mail, a sender needs to first compute its hash value and then encrypt it with his private key. The sender’s public key (certificate) along with the encrypted hash value form a digital signature that it included with the e-mail. Upon receiving a signed e-mail, recipients decrypt the encrypted hash value using the sender’s public key and compare it against the hash value of the message computed by them. If the two hash values do not match, then the message is deemed as untrustworthy and usually rejected. Among all cryptography schemes, S/MIME and DKIM are simplest to use because they do not require installation of any additional software like OpenPGP, GnuPGP, and Identity based encryption protocols. Moreover, S/MIME comes inbuilt with most popular e-mail clients such as Apple Mail, Outlook Express, and Mozilla Thunderbird [69]. How- ever, one disadvantage with these schemes is that an attacker can trigger denial-of-service (DoS)

5 Red line indicates path of encrypted The green line indicates legitimate message to message path guarantee integrity Digital Signature Schemes The red line indicates the path to PGP Is the message delivery after illegitimate modification S/MIME forged or or spoofing of a message leading to GPG altered compromise in its integrity Identity based in transit? encryption Yes No [email protected] or TLS [email protected] [email protected] Bob and Receiving Mail Alice DNS based MTA Server encryption Bob schemes Mail Eve Receiving Alice Server MTA

Figure 1.4: An illustration of e-mail encryption to Figure 1.3: Is the e-mail forged or modified in prevent illegitimate message modification during transit? transit attacks by modifying e-mail messages and their digital signature so that they are outright rejected on the recipients’ side. Instead of encrypting the content of a message, transport layer security (TLS) protocols have been proposed to encrypt the underlying message flow, thereby preventing such DoS attacks.

1.2.3 Non-repudiation

Non-repudiation means that sender of an e-mail message cannot repudiate it at a later point in time as shown in Figure 1.5. Therefore, non-repudiation schemes dissuade malicious senders from sending out potentially harmful e-mails by making them accountable for their actions. In theory, with a combination of authentication and non-repudiation schemes, it might be possible to link malicious e-mails back to their source, and take legal actions against them. There exists a handful of closed systems such as AOL, Hotmail, etc., also known as Walled Gardens, that can provide non-repudiation assurances because they use passwords to authenticate message senders and provide reasonable security for message content [51]. However, as e-mail infrastructure is primarily a decentralized system, there is no single authority present that can track user actions across all service providers. In the current setup, the only way to hold senders responsible for the

6 Path taken for non- repudiation. A message can The green line indicates attributed to sender Audit legitimate message path by his signature or Database Can Bob be held through audit record The red line indicates path of liable for his linking an e-mail message sent sent messages? by Bob back to him, so that he Yes No cannot repudiate at a later [email protected] point in time

[email protected] Bob Mail Receiving Alice Server MTA Bob Receiving Mail MTA Alice Server Digital Signatures Schemes Such as S/MIME, PGP, GPG, Identity Based Encryption

Figure 1.6: Enforcement of non-repudiation pol- Figure 1.5: Can an e-mail be linked to a sender icy in e-mail through digital signature or audit so that he cannot repudiate it? database e-mails sent by them is if they are digitally signed. Since the private keys used in digital signing are known only to its senders, it along with the senders’ public keys can be used to determine their identity as shown in 1.6. Even with digital signatures, non-repudiation in e-mail systems is just a fallacy. Both malicious and legitimate senders can claim that their keys or passwords were stolen and disown an e-mail message sent from their accounts.

1.2.4 Problems with Existing E-mail Security Enforcement Schemes

Existing e-mail infrastructure was initially designed to support only a small faction of known users. By default, it does not contain any inbuilt security mechanisms to guarantee the following three security properties, viz. (i) authentication; (ii) integrity; and (iii) non-repudiation, that are prereq- uisites for secure communication. For this purpose, aforementioned public key cryptography and DNS based schemes were proposed to operate in conjunction with or on top of the SMTP protocol. However, massive adoption and increasing popularity of e-mail have adversely affected the ability to seamlessly integrate the support for these security properties into the e-mail infrastructure due to the following reasons:

7 • Deployment issues – Implementation of authentication schemes like Secure SMTP, SPF, SenderID, CSV and DKIM require substantial modifications to the existing e-mail infras- tructure. Also, both the sender and the recipient need to employ the same authentication algorithm for compatibility purposes. Strictly enforcing just one of these schemes on the recipient side might result in the rejection of legitimate e-mails sent by an incompatible re- laying MTA using a different scheme. E-mail encryption schemes that operate on top of the SMTP protocol are not free from disadvantages either. S/MIME, OpenPGP, GnuGPG and identity based encryption schemes mandate installation of special software programs for functioning that are elusive to day-to-day users. These schemes also suffer from com- patibility issues causing difficulty in reading e-mails on the recipient side. For example, an e-mail message signed with S/MIME appears in unsupported clients as an attachment with the name smime.p7m confusing the users.

• Security issues – E-mail authentication schemes like SPF and CSV validate senders based on relaying SMTP domains, and not on their e-mail accounts. In general, it is entirely up to the service provider to take action against malicious accounts. Given the sheer size of their user base, it might be difficult for service providers to identify and shutdown malicious accounts in a timely fashion. Furthermore, existing schemes that perform integrity checking in e-mails only detect unauthorized content modification done by intermediate MTAs. A phisher can still tarnish a legitimate brand’s integrity by sending e-mails from a fake domain whose name closely resembles a legitimate brand’s domain name. For example, a phisher can legitimately register a domain with name ebay-admin.com, obtain its certificate, and generate spoofed e-mails from that domain that have valid digital signatures. A malicious sender can also claim that his keys were stolen and disown e-mails sent from his account, thereby successfully defying the basis of non-repudiation. Moreover, with the advent of botnets or compromised machines, it has also become possible for attackers to send spurious e-mails from legitimate machines or accounts without the users’ knowledge.

8 • Usability issues – Digital signing and e-mail encryption schemes suffer from severe usability issues. Recently conducted usability evaluation of e-mail encryption software reveal that delegating the task of key management, digital signing, and signature verification onto users is prone to failure [105, 120]. In particular, Whitten and Tygar evaluate the usability of PGP 5.0 program on Macintosh using 12 human participants [120]. Their study concludes that even though 11 out of 12 participants were able to create public/private keys required for signing and 10 were able to distribute their keys to the other users, only four were able to send out properly signed e-mails. Garfinkel and Miller, however, adopt an opposite standpoint and attribute the lapse in usability demonstrated in [52] more or less exclusive to PGP 5.0. Nevertheless, the number of unsigned legitimate e-mails sent everyday by authorized users stands as a testimony to the lack of adoption of these encryption schemes.

1.3 E-mail-based Security Threats

Due to the aforementioned weaknesses in the current e-mail infrastructure, three prominent e-mail based threats have surfaced, viz. (i) spam; (ii) phishing; and (iii) information leak. As shown in Figure 1.7, all the three security properties needed for secure communication have to be incorpo- rated in the e-mail infrastructure in a foolproof manner to protect users from these threats. Also, combination of any two security properties in the figure along with their respective axes represents the threats jointly addressed by them. For example, enforcing authentication and integrity protects users from spam and phishing attacks, while leaving a possibility for information leak at large. In this section, the prevalence of these threats is discussed in detail to bring out their effect on day-to-day e-mail users. Spam is classified as unsolicited junk e-mails that users do not want to receive. Owing to their reachability and low cost per message, spam e-mails are sent out as advertisement mules to market products or services to millions of users. Physical bulk mail per recipient, on the other hand, costs 100 times more than e-mail advertisements [79]. According to Symantec report [6], in

9 AUTHENTICATION

IN F O R M M A A T P IO S N + L G E IN A H K S SECURE I + H S P P A M

INFORMATION LEAK

INTEGRITY NON REPUDIATION

Figure 1.7: E-mail-based threats that can be mitigated by incorporating authentication, integrity and non-repudiation properties in e-mail infrastructure the year 2008, around 80% of all Internet traffic has been spam e-mails. Even though only a small fraction of users respond to such e-mails, it is still copious enough to sustain the popularity and existence of spammers and spam messages [112]. The large volume of spam e-mails sent out daily causes inconvenience to all e-mail users who spend a non-trivial amount of time and effort toward mitigating them. A recent survey done by McAfee found that more than 49% Americans spend around 40 minutes a week deleting spam e-mails [1]. A more serious threat than spam is phishing. In phishing attacks, attackers use forged e-mails and Websites to appear as if they originate from legitimate organization to deceive users into dis- closing personal, financial, or computer account information. This stolen information can be then used by attackers for criminal purposes, such as , larceny and fraud. According to recent Gartner report, in the year 2007, more than 25,000 unique phishing e-mails hijacking 150 different brands were sent out on a monthly basis resulting over $3 billion dollars in damage world- wide [53]. Furthermore, the report also estimates that due to their lucrative nature phishing attacks

10 are going to skyrocket through 2009. An equally important threat is information leak where sensitive and crucial information are leaked out via e-mails. An information leak can be triggered by user or malware. The increase in storage space in e-mail inbox has attracted users to use them as Web folders to store their private and sensitive information. In fact, current Web based e-mail service providers offer storage space in the order of tens of gigabytes. Unfortunately, this has made Web based e-mail clients a target for malicious browser helper objects (BHO) or toolbars, which attempt to surreptitiously steal and leak the information stored in them via e-mail. A more realistic threat, however, is information leak caused by users themselves. Users can inadvertently or intentionally leak out sensitive information pertinent to them or their organizations resulting in violation of privacy and security. A side effect of lack of authentication mechanisms is that attackers can easily send e-mails with malicious attachments, which can infiltrate and damage recipients’ computers. Contrary to popular belief, end-to-end e-mail encryption schemes can also increase the chance of delivery of these attachments into recipients’ inboxes; an attacker can sign and encrypt a potentially harmful e-mail with a malicious attachment so that it cannot be scanned, read and quarantined by anti-malware systems deployed on top of the intermediate routers or MTAs. Also, e-mails can be devised to exploit vulnerabilities present in the SMTP software itself. In the last three years alone, more than 50 different vulnerabilities in SMTP software have been reported in National Vulnerability Database (NVD).

1.4 Dissertation Scope

As the aforementioned e-mail based threats still loom at large, there has been a need to shift focus from enforcing cryptography based solutions that impede attackers from sending out spurious e- mails to devising efficient filters that monitor both incoming and outgoing e-mails. Filtering based solutions have several advantages over cryptography based solutions: (i) First, they do not neces- sitate significant modifications to existing e-mail infrastructure. They can be directly deployed

11 atop either receiving MTA or Mail Delivery Agent (MDA); (ii) Second, unlike cryptography based solutions that require users to sign and encrypt their outgoing messages, or verify the incom- ing messages’ signatures, the filtering based solutions are completely automated and minimizes human-in-the-loop; and (iii) Finally, as mitigating these e-mail based threats is an arms race, with filtering based solutions it is possible to test and deploy novel detection algorithms that address a particular threat class in an expedited manner. Hence the aim of this dissertation is to devise efficient filters that address each of these e-mail based threats in a piecemeal fashion. Also, an important factor that impacts the performance of these filtering based solutions is the ability to identify features that accurately characterizes spurious e-mails from legitimate ones. Most of the content filtering approaches ignore this crucial point and employ features that significantly overlap with both legitimate and spurious e-mails. As a result, it has become relatively easy for attackers to blindside these existing filters by crafting spurious e-mails that closely mimic legitimate ones. To this extent, the crux of the research presented in this dissertation focuses on identifying an appropriate feature set that accurately delineates spurious e-mails based on the structural and oper- ational behavior. The structural behavior identifies the deception agents in the body and header of spurious e-mails that hide their true intentions so that deployed filters or users mistakenly construe them to be legitimate. The operational behavior encapsulates the invariant characteristics intrinsic to spurious e-mails that are needed to successfully carry out the underlying attack. The remainder of this section gives out the scope of this dissertation in addressing these e-mail based threats. In the last decade, spam e-mail filtering has received a lot of attention from the research com- munity. Depending on their point of operation, the existing spam classification approaches can be broadly placed into the following two categories: (i) classification based on content analysis; and (ii) classification based on network analysis. As the name suggests, techniques that focus on content analysis treat spam e-mail detection as a text classification problem. The idea is to sep- arate spam e-mails from legitimate ones based on certain characteristics in the e-mail body and header that are intrinsic to them. These characteristics usually include commonly occurring spam keywords, senders’ and intermediate MTAs IP addresses and other MIME properties. Since spam

12 servers exhibit high egress traffic, they can also be singled out using network analysis techniques that distinguish between normal and abnormal traffic patterns. In general, when deployed to clas- sify spam e-mails, both these techniques produce high detection rates and low false positives. This is primarily due to the fact that it is easy to demarcate spam e-mails from legitimate e-mails based on their appearance and content. Also, even if spam e-mails successfully evade these filtering mechanisms and arrive at users’ inboxes, as they are innocuous in nature, they do not pose a direct threat to users’ security and privacy. For these reasons, the main focus of this dissertation is not spam detection. Phishing e-mail detection is a hard problem. Traditionally, spam detection techniques were extended to filter phishing e-mails. However, such an approach fails miserably because unlike spam e-mails, which are markedly different in appearance and content than legitimate e-mails, phishing e-mails closely imitate the e-mails from legitimate financial organizations so that spam filters cannot distinguish them both apart. A simple way to overcome this limitation is to devise customizable filters that can examine incoming e-mails purporting to originate from financial do- mains to check if they address users in a personalized manner. In theory, as phishing e-mails do not contain users’ personal information such as lastname, last four digits of account number, address, transaction identifier, etc., that are usually contained in legitimate e-mails, this difference in structural trait can be applied to separate them. There have been several specialized approaches proposed that focus solely on detecting phishing e-mails. However, exist- ing approaches treat detection of phishing e-mails as a binary text classification problem, which operates on bag-of-words features. As these approaches do not take legitimate e-mails from finan- cial organizations into account, the task of classification is reduced into separating non-financial e-mails and financial e-mails, and not between legitimate e-mails from financial organizations and spoofed e-mails that mimic them. Hence these approaches suffer from high false positive rates. Instead of examining the textual content, some phishing filters analyze structural features (e.g., HTML layout, MIME format, embedded links, etc.) present in phishing e-mails as means for classification [49]. However, even these filters can be evaded by simple obfuscation attacks. The

13 link common to all existing phishing e-mail detection approaches is that they do not treat phish- ing as a social engineering attack. As phishing e-mails prey on users’ gullibility, extracting their underlying semantic meaning and/or analyzing the behavior of the Websites linked-to by URLs contained in them as opposed to classifying them just based on content and appearance would help in increasing the detection rates while keeping false positive rates low. The semantic meaning of a spoofed e-mail can be used in generating context sensitive warning messages to educate users about working of phishing attacks. Due to the impact on everyday users and the lack of effective filtering solutions, detecting phishing e-mails is made the focal point of this dissertation. Information leaks pose significant risk to users’ privacy. An information leak could give out users’ browsing characteristics or disclose sensitive material contained in their e-mail inboxes to unauthorized parties. To detect an inadvertent informational leak, [24] models the past behavior between a sender and a set of recipients based on previously exchanged messages. The generated model is then used as basis for detecting information leak between the sender and the same set of recipient(s) in the future. The past behavior is encapsulated mainly using the bag-of-words features and the social network information. As a result, this information leak detection solution performs poorly because linguistic and stylometric features that encapsulates the semantics of the exchanged messages are not taken into account. Information leak could also be triggered by malware that steal users’ browsing characteristics. Knowing about users’ browsing characteristics can serve as a vital resource for attackers enabling them to send targeted phishing e-mails to users, thus increasing the chance of users falling prey to phishing attacks. In general, the information is stolen by that surreptitiously operate on top of e-mail clients or browsers. Furthermore, such spywares intertwine its activity with normal user activity in order to evade detection. The existing approach proposed to detect the difference in activities can be thwarted easily by designing an intelligent class of spyware [20]. The research presented in this dissertation takes a step further by addressing the weaknesses in existing information leak detection approaches so that both user and malware triggered leaks are detected in an effective manner. Another important point to be considered while devising specialized filters to address each of

14 the e-mail based threat is the need to make them interoperable. For example, an e-mail suppos- edly sent from a financial domain, but having an URL referring to a domain blacklisted for spam is very likely a phishing e-mail. Identifying sources of attacks helps in developing attack agnos- tic solutions that block all sensitive communication from and to misbehaving nodes. From this perspective, this dissertation attempts to build a holistic framework that not only operates in con- junction with intrusion detection systems (IDS) to block incoming and outgoing traffic from and to misbehaving nodes, but also safeguard the underlying e-mail infrastructure from zero-day attacks.

1.5 Original Contributions

This dissertation makes the following contributions:

• A customizable and usable spam filter to separate phishing and legitimate e-mails based on the personalized information present in their content. The filter has been implemented as an add-on to Microsoft Outlook – a popular e-mail client used by personal and corporate users. This work has been published in [33].

• A novel methodology to detect phishing e-mails based on the underlying semantic meaning conveyed in them. Rather than focusing the textual content alone, linguistics and stylomet- ric features that uniquely identify phishing e-mails from e-mails from legitimate financial organization are considered. This work appeared in parts in [29] and [30].

• A proactive challenge-response based technique to detect phishing e-mails by analyzing the behavior of Websites linked from URLs contained in them. This helps in overcoming ob- fuscation attacks that encode the content of phishing e-mails using images or non-standard formats. A proof-of-concept implementation is provided as an ActiveX plug-in to Internet Explorer version 6 (IE 6) browser. This work has been published in [28] and [30]. Also, the feasibility of detecting spam e-mails by analyzing the properties of linked-to Websites is explored.

15 • An efficient context-sensitive warning generation mechanism that conveys the import of phishing e-mails directly to users. This helps in educating users about the working and the effects of phishing attacks. The warnings are provided in the form of a modal text box that disallows users from responding to phishing e-mails. This work was published in [33].

• A randomized honeytoken based mechanism to detect the evasive spyware that surrepti- tiously steals users browsing characteristics by blending in with normal user activities. This work appeared in [32]. Also, a mechanism to detect inadvertent information leak via e-mails using linguistic and structural features is proposed.

• A framework to support platform to protect e-mail infrastructure from zero-day attacks. The framework is generic enough to generate just-in-time solutions that protect all the deployed services in a network. This work appeared in [26] and [27].

1.6 Dissertation Outline

Chapter 2 begins by examining the literature related to e-mail based threats. Specifically, existing works pertinent to phishing, spam and information leak countermeasures are discussed in detail to place the ideas discussed in this dissertation in perspective. Chapter 3 discusses detection of phishing e-mails based on the linguistic and stylometric features. These features are extracted from the text present in the e-mail body, and are passed as input variables to machine learning algorithms that separate legitimate and illegitimate e-mails. Chapter 4 focuses on validating e-mails by ana- lyzing the websites that are linked-to by the URLs present in the e-mail body. Chapter 5 studies the problem of addressing information leak in e-mails by proposing randomized honeytoken based countermeasures that can detect evasive spyware that steals users’ personal information by blend- ing them with normal Internet traffic. Also, information leak caused inadvertently by senders is addressed. Chapter 6 concentrates on developing a scalable framework to mitigate inbound mal- ware threats that target vulnerabilities in existing e-mail framework. The proposed framework can

16 also aid system administrators’ to mitigate zero-day attacks that target other non-SMTP ports, es- pecially when vendor supplied patches are not available. Finally, concluding remarks, impact and future work are presented in Chapter 7.

17 Chapter 2

Background and Related Work

“The great thing in the world is not so much where we stand, as in what direction we are moving.”

− Oliver Wendell Holmes

2.1 Introduction

Chapter 1 introduces the shortcomings of present day e-mail infrastructure by listing out its design flaws and attacks that exploit them. The existing literature that attempts to tackle these limitations can be broadly divided into two categories: (i) Cryptography based solutions; and (ii) Filtering based solutions, depending on their mode of operation. As the name indicates, cryptography based solutions focus on securing the existing e-mail infrastructure by encrypting messages communi- cated between senders and recipients. These schemes employ publicly available standards such as S/MIME, PGP and GPG to encrypt, decrypt and validate email messages. Spam protection framework (SPF), SenderID, Certified Sender Validation (CSV) and DomainKeys have also been proposed as alternate mechanisms to authenticate emails based on the senders’ domain name. DomainKeys uses digital signatures to encrypt and sign the entire e-mail message, where as SPF, SenderID and CSV examine the sender’s domain name present in the e-mail header to determine forgery. Even though these schemes assist in mitigating e-mail based security threats, they suffer from several deployment, security and usability issues.

18 Strong token based authentication mechanisms can be enforced using physical devices like smartcards, Bluetooth devices, USB tokens, and other hardware that generate onetime passwords. As these generated passwords expire after a single logon, it is not possible for a phisher to imper- sonate users at a later point in time. Some commercial implementations of such hardware tokens include RSA’s SecurID, Aladdin’s eToken and ActivIdentity tokens. Even though these schemes provide multi-way authentication by having an out-of-band (OOB) authentication channel, they are still susceptible to man-in-the-middle attacks where phishers, to avoid detection, replay the obtained password in the legitimate Website and then present the received information back to the users. Moreover, these approaches incur high setup and management costs making their large scale adoption difficult. While cryptography based solutions aim to address the fundamental flaws with the e-mail in- frastructure, filtering based solutions weed out unwanted e-mails that pose threat to users’ security and privacy. Filtering of spurious inbound e-mails (ingress filtering) can prevent spam, phishing, and unsafe e-mail attachments from plaguing the users’ mailboxes. On the other hand, filtering out outbound e-mails (egress filtering) can prevent both accidental and intentional information leak via e-mails that have potential to result in privacy violations. Unlike cryptography based solu- tions, filtering based solutions are easy to implement and deploy as it does not require the sender and the recipient to have any pre-communication agreement pending an e-mail transfer. At an operational level, filtering based solutions are behavior based anomaly detection systems, which examine structural trait (such as e-mail body and header content, DNS entries, timestamps, etc.) and user behavior (such as their pattern of usage, outgoing and incoming mail statistics, login time, message composition and dispatch times, etc.) to build discriminator models that can separate ab- normal e-mails from the rest. Filtering based solutions are deployed at the recipient side, either on the mail server or at the user level. Most of the filtering based solutions operate in a piecemeal fashion, and operate by targeting one particular e-mail based threat vector, rather than providing a unified solution. Due to ease of adoption, and customizable nature to tailor to the recipients’ specifications, filtering based solutions have taken precedence over cryptography based solutions.

19 There exist a significant number of works that just focus on behavior based spam detection. For example, Shlomo [58] and Sculley [104], in their doctoral dissertation, propose online learning techniques to mitigate spam e-mails by analyzing e-mails’ content and recipients’ usage patterns. These works, however, fail to address the more important problem of phishing e-mail detection. As phishing attacks involve spoofed e-mails that closely resemble their legitimate counterparts, gener- alized “spam” detection techniques cannot be used to prevent them. These techniques also ignore the social engineering component, where attackers lure potential victims into executing malicious attachments or divulging their sensitive information in spoofed Websites. Another important area left unexplored by these works is prevention of information leak in e-mails. Leak of sensitive in- formation via e-mails can be hazardous resulting in compromise of user’s privacy and violation of organization’s policies. The literature overlapping privacy and security in e-mails is very limited. In this chapter, existing solutions relevant to this dissertation are discussed in detail to bring out some of the issues that are unaddressed or overlooked in them. Such an effort is not only helpful in understanding the progress made in the field of e-mail security, but also in bringing out the novelty in the research described in this dissertation.

2.1.1 Chapter Organization

Section 2.2 surveys prior research that concentrates on detecting phishing e-mails before they reach the user’s mailbox. Section 2.3 discusses about e-mail validation techniques by analyzing Websites referred by URLs contained in them. Information leak in e-mails is a serious problem. Section 2.4 gives out approaches that focus on detecting information leak trigerred by users and malicious software such as spyware. Concluding remarks are provided in Section 2.5.

20 2.2 Phishing E-mail Detection

Classification of phishing e-mails usually occurs before they reach the user’s inbox. As phishing e-mails fall into the unsolicited bulk e-mail category, traditionally, anti-spam mechanisms were used to filter them. One of the ways to identify phishing e-mails is to verify the identity of the sender. Similar to spam mitigation techniques, blacklisting was adopted for blocking spoofed or potentially dangerous e-mails [45, 46, 56, 98]. If the sender’s IP address is found in a blacklist, then the e-mail is either rejected outright or at the very least marked as spam and placed into a junk folder. A blacklist can be maintained by an individual user by hardcoding restricted IP addresses in mail delivery agents (MDA) or e-mail clients. For example, a user can empower MDAs such as procmail maildrop filter out e-mails from known spam sources. Most Web based or desktop based e-mail clients have provisions allowing users to maintain their own personalized blacklists. Alternatively, blacklists can be maintained by an organization or collaboratively by a group of users. Vipul Razor [98] is a collaborative, blacklist based spam filtering mechanism, which require active user participation to identify spam e-mails. Once a reputed (or trusted) user marks an e-mail message as spam, its signature along with the sender information are extracted, and published into the blacklist maintained by a distributed server. Clients that are subscribed to this server are “regularly” updated with newly published information, which can be used to block similar spam e-mails. The blacklists can also be maintained by Internet Service Providers (ISPs) as DNS Blacklists (DNSBL). ISPs can place IP addresses of abusive or well known spam servers into DNSBLs that can be easily queried by computer programs over the Internet. As new IP addresses are relatively cheap to obtain, it has become hard for these lists to keep up with constantly changing spam sources. Recently, botnets are being used to send out spam and phishing e-mails. Typically, as botnets have thousands of member bots, it is possible for a controlling botmaster to continuously operate by altering the set of individual bots used for sending out spurious e-mails. Even otherwise, as spam sources are short-lived, they need to be promptly identified, entered into blacklists and propagated to clients within a matter of few minutes or seconds. Ramachandran et

21 al. [101] estimate that most of the spam bots have lifetime in the order of seconds, and send out spam e-mails to a domain only once in 18 months. As a result, most of the phishing e-mail regularly bypass the blacklists. Sinha et al. [108] study the effectiveness of four popular spam blacklists on an academic network. In their research, they reveal that all of these spam blacklists are tainted with incorrect or incomplete information, consequently having large false negative and false positive rates. Whitelisting, on the other hand, controls user behavior, restricting users to receive messages from only a list of acceptable sources. All the mail servers belonging to known domains need to be placed into a whitelist. E-mails from unauthorized machine (not in the whitelist), are rejected. However, defining a whitelist is a hard problem as it requires enlisting of all legitimate senders beforehand. Greylisting is another recently developed technique to deter spammers from sending out spurious e-mails. In greylisting, MTA will temporarily reject all e-mails from unidentified senders. The rationale is that a legitimate sender would retry sending the message again at which point it is accepted. Since spammers send out millions of e-mails, they typically cannot afford the time delay to retry. On the flip side, greylisting increases the computation bandwidth both in the sender and recipient side, delaying the time taken to deliver e-mails even in legitimate cases. As opposed to aforementioned techniques, a reliable way of detecting phishing e-mails is through content analysis. Content analysis techniques classify phishing e-mails by examining their intrinsic textual content and structural trait. The classification is done either through simple rule based heuristic or advanced supervised machine learning algorithms. Mozilla Thunderbird [36], a popular desktop based e-mail client, detects phishing e-mails by checking to see if they have either dotted IP URL or different visible and hidden URLs in their content. Using feedback from more than 30000 Hotmail users and data from honeypots, Microsoft extracts more than 100000 attributes present in phishing e-mails [87]. These attributes are then used to train its SmartScreen phishing filter. Furthermore, the filter is constantly monitored and fine-tuned by a team of domain experts. Unfortunately, no information about its performance is provided. In a similar vein, Drake et al. [44] survey the list of features that can be applied to detect phishing e-mails. Their work, however, does not elaborate on how the identified features can be used for detection. Fette et al. [49] propose to

22 detect phishing e-mails using structural features present in the e-mail body. The structural fea- tures include presence of JavaScript, HTML content, dotted IP URLs, number of URLs, number of domains, number of dots in URLs, nonmatching URLs, etc. These features are first extracted and then fed into supervised machine learning models that perform binary classification (phishing or ham). Textual content is not taken into account for detection purposes. It is quite normal for phishing e-mails to contain same number of URLs, MIME format, domains format as legitimate e- mails. Also, phishing e-mails can be constructed omitting these selected features. For example, a phishing e-mail composed without any of the other structural features, can still be successful with just one link that redirects users to fake Websites. As a result, their method has significantly large false negative rates (ranging from 3.6% - 8.5% depending on the input features). They also do not test their approach on legitimate e-mails sent over by financial organizations. Bergholz et al. [17] apply statistical models on both structural and textual features for detecting phishing e-mails by assigning weight to individual features. Their feature set also contain graphical features of the e-mails including logos and images present in them. They do not provide details on whether their model was tested on ham e-mails sent by legitimate financial institutions. Also, their dataset is not publicly available to benchmark the outcome of experiments. Bag-of-word classifiers that focus just on the textual content have been used to classify phishing e-mails. SpamAssassin [110], for instance, has a Bayesian classification module that trains on the textual features present in spam e-mails. As words that appear in phishing e-mail are also found in legitimate e-mails (e.g., words such as bank and account are commonly found in both phishing and legitimate e-mails), these approaches suffer from degraded performance. Abu-Nimeh et al. [7] transform phishing e-mail detection into a text classification problem. By employing textual features present in the e-mail body, they attempt to classify incoming phishing e-mails using different popular classifiers, e.g., logistic regression, classification and regression trees, Bayesian additive regression trees, Support Vector Machines (SVM), random forest and neural networks. Linear regression and Random Forest exhibited best false positive rate (4.89%) and false negative rate (11.2%) respectively.

23 2.2.1 Discussion

Detecting phishing e-mails is not akin to spam e-mail detection as it requires an entirely different set of specialized features. Traditional spam filters exhibit poor performance when deployed for detecting phishing e-mails. Recent phishing e-mail detection techniques rely on supervised ma- chine learning algorithms that operate on specialized feature sets, which accurately characterize phishing e-mails. Even though these approaches use specialized features, they overlook a simple, yet crucial, fact that most e-mails from legitimate financial institutions also share the same features as phishing e-mails. Also, their experiments do not validate the proposed classification models with e-mails from legitimate institutions. To get an accurate and fair measure on a classifier’s performance, it is essential to treat its primary objective to segregate e-mails into two classes – phishing and ham, where the ham set includes e-mails from legitimate financial institutions. This would help in accurately demarcating features found in phishing e-mails, and exclude features also common to legitimate financial institutions. In order to achieve this goal, this dissertation proposes a novel technique to encapsulate the semantic meaning of e-mails. Capturing semantic meaning or tone of the e-mail using linguistic features would assist in identifying the accurate intent of the e-mail, thereby improving classifiers’ performance. Also, this can be used to generate context sen- sitive warnings, which educate the users about the operation of phishing attacks. The experiments are carried out using popular phishing and ham datasets, which are freely available.

2.3 Validating E-mails Through Referral Webpages Analysis

The approaches discussed in the previous section attempt to detect spurious e-mails by examining information present in their content. An alternate way to classify e-mails is by analyzing behavior and appearance of Websites linked-to by URLs contained in them. As the objective of any phishing or spam e-mail is to attract users into visiting these linked-to Websites, unlike e-mail, these Web- sites contain information that need to be clearly communicated to users, and are not obfuscated with junk material, thus providing ample features for classification.

24 In phishing attack scenarios, linked-to Websites mimic appearance and behavior of legitimate Websites to trick users into believing that they are actually visiting legitimate Websites. Since most of the phishing attacks rely on the inability of users to tell legitimate and fake Websites apart [43], several commercial and open source toolbars have been proposed to aid them in decision making, especially at the time of browsing. SpoofStick toolbar [111] displays Website’s real domain name, in order to reveal phishing sites that obscure their domain name. When an attacker uses a legitimate domain as subdomain, e.g., http://[email protected]; SpoofStick would display fake.ru as visiting domain in its toolbar. However, as it only provides visual cues, leaving the final decision to be taken by users, it does not provide an automated solution to detect phishing attacks. NetCraft antiphishing toolbar [93] is another such tool that employs client-server architecture for detecting phishing Webpages. User computers installed with the toolbar act as clients, and are responsible for relaying suspicious Websites to a centralized server. The server processes information sent over by clients, and determines each reported domain’s risk rating, rank, age, hosting country and organization. This information is then sent back and displayed on clients’ toolbars to assist users in detecting phishing Websites. As mentioned earlier, the main disadvantage with such an approach is that as phishing Websites are ephemeral, it might not be possible to propagate the processed information to clients in time. SpoofGaurd [34] calculates spoof score of Websites using stateful and stateless evaluation techniques. The stateless evaluation analyzes Websites for broken links, obfuscated URLs, invalid HTTPS connection, validity of SSL/TLS certificates, and authenticity of embedded images. On the other hand, stateful page evaluation techniques monitor every outgoing data using site specific salts so that a user does not provide his username and password into a site he has never visited before. A major disadvantage with these approaches is that they are susceptible to attacks launched from compromised legitimate Websites. Using freely available Web hosting domains, an attacker could create an account with username login, and launch a successful phishing attack by hosting the fake login page of that domain from his Web folder; the fake login page would appear under www.domain.com/login1. These toolbars can also be directly attacked by

1it is worth mentioning that one such attack was carried out on geocities.com, a popular Web hosting domain from

25 phishers [62, 125]. Java Scripts and Java Applets can also be used by attackers to hide or fake the visual cues shown by these toolbars [117, 127]. A recent study with 10 popular anti-phishing toolbars conclude that the toolbars are ineffective in detecting phishing Websites, as they fail to identify over 15% of phishing Websites used in testing [132]. As detection of spoofed Websites involves human-in-the-loop, several approaches have been proposed to assist users in detecting them from a usability standpoint. Since users are inept at cor- rectly identifying fake Websites from technical content (e.g., URL, IP address and domain informa- tion), these techniques help in validating them using personalized information. For example, Ya- hoo! allows users to upload and display a personalized image near the login window to distinguish from phishing attacks. Dhamija and Tygar [42] propose dynamic security skins to display user uploaded image near the login window for easy identification. Graphical passwords [66, 113, 121] have been proposed to provide mutual authentication between users and Websites by asking the users to identify a set of images before log-in. The images used for authentication are uploaded by users at the time of account creation. Graphical passwords also place additional burden on Web- sites requiring them to display correct set of images corresponding to every user. Using images for authentication is still not foolproof, as it is vulnerable to shoulder surfing. Similar to toolbars, these techniques postpone decision making until Websites are loaded into the browser by the users, thereby exposing users closer to attack source. Passively analyzing linked-to Websites without having users load them into browsers can elim- inate human-in-the-loop errors. CANTINA [133] is one such tool that can be applied to extract frequent occurring words from Websites referred by URLs present in e-mail body. These words are fed into a popular search engine. If the domain name of URL matches with the domain name of top N search results, then the e-mail is considered to be safe or suspicious otherwise. Similarly, list of legitimate Websites frequented by users can be downloaded and cached locally. If a linked-to Website has similar structural layout and content as a cached Website, but is hosted on a different domain, it is marked as phishing [72]. Spamscatter [10], on the other hand, uses image shingling

Yahoo!

26 techniques to cluster scam Websites based on visual similarity. Screenshots of linked-to Websites are first constructed using KHTML layout engine. The screenshots are then broken down into smaller visual hashes (shingles), and are grouped into clusters based on their similarity (i.e., screenshots having similar shingles are placed in the same cluster). Each cluster is analyzed to determine the type of scam hosted in its pages. However, it is still possible for attackers to evade such a system by subtly changing the appearance of fake Website so that its similarity with corresponding cached Website goes unnoticed.

2.3.1 Discussion

The main reason for failure of existing approaches is that they tend to overlook the behavior of linked-to Websites while evaluating them. As a result, recent attacks, growing in ingenuity and sophistication, are able to blindside them successfully. In order to overcome this limitation, this dissertation proposes PHONEY, a novel challenge-response based methodology that proactively examines the behavior of linked-to Websites for authentication. The key idea behind our approach is to protect the real user’s identities by designating phantom users to provide fake (phoney) in- formation to the websites requesting critical information until their authenticity is verified. The assumption here is that just as an end-user cannot tell legitimate and spoofed emails apart, sim- ilarly phishers cannot tell the responses of legitimate and phantom users apart. As the mail user agent (MUA) receives emails, PHONEY analyzes its content for the presence of embedded links and attached HTML forms. If the email contains no such suspicious characteristics, further in- vestigation is discarded. Otherwise, it parses the HTML content, extracts the form elements and supplies fake values to the information requested by them. Since a masquerading Website cannot verify the credibility of the supplied information, and is indifferent to both the real and the con- trived response showing no difference in behavior. PHONEY is tested with live data to demonstrate its ability to detect a wider range of phishing attacks than existing schemes. Also, performance analysis study shows that the implementation overhead introduced by it is very negligible. In benign cases, where linked-to Websites do not contain HTML form elements requesting

27 sensitive information, validation of e-mails is done directly by analyzing the content of linked- to Websites. For example, spam e-mails often contain URLs that refer to e-commerce Websites, which sell or advertise products related to adult, financial, drugs and leisure industries. Typically, such spam e-mails are regarded as nuisance, and do not pose a direct threat to users’ privacy. However, since the brands targeted by spammers are fairly static, it is possible to easily detect them using the content of these e-commerce Websites. To this extent, this dissertation highlights a set of features that accurately encapsulate appearance and content of the linked-to Webpages. Classification is done using machine learning algorithms such as support vector machines (SVM), Bayesian classifier and decision trees that act on the chosen features [81]. Experimental evaluation is done using live spam and ham (legitimate) e-mails, and the performance is compared against two popular open source anti-spam tools.

2.4 Preventing Information leak in Emails

The previous two sections discuss existing literature that proposes to alleviate problems arising due to lack of authentication and integrity in e-mail infrastructure. An equally important problem is preventing information leak in e-mails, where crucial and sensitive information about the users are leaked out leading to privacy losses. In the first part, information leaks caused by malware, which surreptitiously operate in the background without the users’ knowledge is addressed. Martin et al. [82] propose a behavior based methodology to detect abnormal outgoing e-mail activity due to e-mail worms. Using features ex- tracted from e-mail content and past pattern of usage, their classification mechanism identifies e- mails sent unknowingly from user’s inbox. These e-mails, however, do not cause information leak, but act as a launch pad for and phishing attacks. As Web based e-mail clients are be- coming popular, it has become possible for browser based spywares to steal information contained in e-mails and send it out as HTTP packets [5]. Also, these spywares can leak users’ browsing characteristics to attackers so that they can launch context-aware attacks against potential victims,

28 thus increasing the chance of success. In both cases, since sensitive information leaked out is not in the form of e-mail, but rather as HTTP packets, most existing e-mail based extrusion detection mechanisms fail to detect it. Webtap [19] is one such tool which monitors outbound HTTP traffic to detect spyware programs by separating user activity from spyware activity. However, as recent spyware program blends with the user activity to evade detection, distinguishing spyware activity from the user activity is not always possible [20]. To detect such evasive spyware, a honeytoken based approach has been proposed. These techniques operate by sending a known sequence of network requests that mimic user activities. As spywares cannot distinguish honeytokens and le- gitimate user activity, they attempt to operate with the honeytokens by making additional network requests (which are then identified by the gateway NIDS). Unfortunately, if static honeytokens are used, it becomes trivial to design new class of intelligent spyware that can evade detection. As e-mails are being used as a communication medium for sensitive transactions, unauthorized access to e-mail content can result in devastating privacy violations. Despite its seriousness, there is not much work done in this area. Preventing information leak in e-mails can be modeled as an extrusion detection problem – where outgoing e-mails need to be analyzed to see if they are intended for the senders listed in either TO: or CC: fields. Information leak in e-mails can be either user or malware triggered. Accidental leak can occur if users unsuspectingly send e-mails to un- intended recipients. Carvalho and Cohen [24] were first to address this issue in a formal fashion. They propose a novel way to simulate information leaks on a real e-mail corpus. Applying textual content and network patterns as features, they use supervised learning techniques to detect outliers – where an e-mail waiting to be sent does not fall in same category as previous messages sent to the same recipient. Even though they provide a formal way to model information leak in e-mails, their proposed detection mechanism is still immature. As correctly pointed out by the authors, to improve the classification results it is imperative to use a different set of features, or a better learning algorithm. Balasubramanyam et al. [14] take this one step further, and propose CutOnce, an extension to Mozilla Thunderbird, to address usability issues involved with building an online system for detecting information leaks in e-mails. Their extension relies on existing information

29 leak detection capabilities proposed in [24], integrated with a novel recipient recommendation sys- tem. Similarly, Kalyan and Chandrasekaran [68], propose to detect information leak in financial e-mails by employing non-textual features such as time at which message was sent, type of at- tachment files, size of the message, presence of company or personal address in the CC: field etc. The proposed technique is able to detect 92% of information leaks in a dataset with 554 e-mails and 70 leaks. Unfortunately, no details about the dataset or how information leaks were labeled initially are provided. Lieberman and Miller [77] propose an extension to Web mail by displaying pictures of the intended recipients in a peripheral display to prevent accidental information leak during message composition. The pictures act as visual cues, who can detect misdirected e-mails before they are actually sent out, with only a brief glance. However, their approach is feasible only in closed environments, where it is easy to obtain images of recipients. Another related, yet diametric problem is authorship attribution in e-mails. Since existing e-mail infrastructure does not have a built-in non-repudiation system, it is possible for senders to disown messages sent by them at a later point in time. From this standpoint, authorship attribution techniques offer solutions to decipher the real identity of a user from a set of suspected users. The main difference between au- thorship attribution and information leak detection is that the former attempts to verify if an e-mail was sent out by the exact same sender listed in the e-mail header, while the latter determines if an e-mail is rightly sent to intended set of recipients. A disadvantage common to all aforementioned approaches is that they are vulnerable to information leaks triggered by malware.

2.4.1 Discussion

In this dissertation, a two pronged approach to detect information leak triggered by users and spywares is adopted. To detect accidental information leak in e-mails, this dissertation proposes to address the limitations in [24] by selecting a variety of stylometric, structural and linguistic features that accurately characterizes past communication between senders and recipients. The detection of information leak is aptly transformed into a text classification problem. Experiments conducted on simulated information leak reveal that the proposed methodology exhibits better performance

30 when compared with [24]. Current static honeytoken based spyware detection mechanisms fail to prevent information leak caused by evasive spyware. In order to highlight the limitations with existing spyware detection schemes, a new class of spyware SpyZen is introduced, which theo- retically can circumvent the defense mechanisms that use static honeytoken based approach. The attack illustrating this circumvention is synthesized by means of data mining algorithms like asso- ciative rule mining [8]. Next, a randomized honeytoken generation mechanism called SpyCon is proposed to address this new class of spyware. Experimental results show that static honeytokens are detected with near 100% accuracy whereas randomized honeytokens are similar to realistic user activity, and hence, are indistinguishable. One of the open research challenges is to develop an attack-agnostic defense framework [21], especially to protect from Internet based attacks where it may be possible for the misbehaving node to change its threat vector. As the current defense mechanisms are geared toward successfully pro- tecting from a single threat vector such as phishing, spam, spyware, etc., they adopt a nonchalant viewpoint towards detecting misbehaving nodes. For example, even though a misbehaving node is blacklisted for spam, it can still host phishing attacks to collect financial information. Bearing this in mind, effort to develop an attack agnostic defense framework, AEGIS, is proposed, which acts as firewall to block incoming and outgoing packets corresponding to previously identified malicious hosts.

2.5 Summary

This chapter presents a detailed survey of mechanisms used in existing literature to overcome the security and privacy issues in e-mail infrastructure. It also outlines drawbacks and shortcomings of these solutions and highlights advances made by this dissertation in overcoming some of these shortcomings. The first problem studied in this dissertation, i.e., the problem of detecting phishing e-mails, is presented next.

31 Chapter 3

Detection of Phishing E-mails Based on Structural and Linguistic Features

“Those you trust the most can steal the most.”

− Lawrence Lief

3.1 Introduction

Phishing is a form of Web based attack where attackers employ deceit and social engineering to defraud users of their private and confidential information such as password, credit card number, social security number (SSN), bank account number, etc. As the Internet is becoming the de facto medium for online banking and trade, phishing attacks are gaining notoriety, especially among the hacker communities. Anonymity over the Internet, coupled with incentives for large financial gains serves as a strong motivation for attackers to perpetrate such seemingly low risk, yet high return scams. The first recorded mention of phishing attacks was in AOL forums [122], wherein attackers posing as system administrators tried to trick registered users into disclosing their account informa- tion. Since then, phishing attacks growing in sophistication and ingenuity have affected millions of users causing heavy monetary losses to institutions and individuals alike. The success of a phishing attack largely depends on the phisher’s ability to mimic legitimate organization’s Web- sites and e-mails in a manner that naïve users cannot identify them. Owing to their popularity and

32 Figure 3.1: Steps involved in e-mail based phishing attack widespread adoption, e-mails serve as the favorite vehicle to launch phishing attacks. Typically, in an e-mail based phishing attack, phishers dispatch large volume of spoofed e-mails containing embedded URLs to redirect potential victims into fake Websites to trick them into disclosing their confidential information. The steps involved in e-mail based phishing attacks are shown in Figure 3.1. Recently, significant research efforts have been undertaken to detect and thwart phishing at- tacks. Most of these defense mechanisms are implemented as browser add-ons and third-party toolbars that examine every URL visited by user to verify its authenticity. Even though these tool- bars have enjoyed initial success in protecting users from divulging confidential information into fraudulent Websites, they suffer from several disadvantages. First of all, these solutions need to be locally installed in every single computer, making them less scalable. Second, most of these tool- bars adopt decisions based on blacklists (or whitelists) that are propagated to them by centralized servers. However, as phishing Websites are ephemeral, there is a high chance that the information about a suspect Website does not reach clients on time. Lastly, as these toolbars operate on Web- sites, they allow users to get closer to the actual attack thus giving little leeway for misclassification errors. A recent study on 10 popular anti-phishing toolbars conducted by Zang et al. [132] showed that the toolbars when tested on live phishing data exhibit poor performance having an over all accuracy of less than 60%.

33 An orthogonal approach to detect phishing attacks relies on filtering out the spoofed e-mails even before they reach the user’s mailbox. As e-mail filters can be installed directly on the in- coming mail server, they are more scalable and robust than the browser-based solutions. How- ever, existing spam filters are mostly equipped to handle unsolicited e-mail messages, and are not effective against phishing e-mails that bear striking resemblance with their legitimate coun- terpart. As phishing e-mails are composed in bulk, they lack any information in their content that can relate the e-mail addresses with the account holders’ personal information. On the con- trary, e-mails from legitimate institutions carry user specific information that are not known pub- licly (such as transaction identifiers (tids), abbreviated version of their account number, full name, date-of-birth, address, etc.). This information, in turn, can be put to use as security certificates to validate the sending domain’s legitimacy. For this purpose, this chapter proposes a customizable and usable spam filters that can detect phishing e-mails based on the user specific information contained in them. Even though, this filter acts as a first line of defense, as it is a rule based system, it can be easily evaded by phishers. To address this limitation, a few specialized efforts have been undertaken which attempt to classify phishing e-mails based on a variety of in- trinsic features contained in them: these features include, but are limited to, the content type of the message (Plain text/HTML), nature of the contained URLs (dotted IP/encoded format), credibil- ity of the referred domains, words that frequently appear in the e-mail content. Once an accurate characterization of phishing e-mails is obtained, every incoming e-mail can be analyzed to extract out the features that are common to phishing e-mails. The extracted features are then fed as input to supervised machine learning algorithms that build discrimination models to separate phishing e-mails and ham e-mails apart. Since the performance of the classification algorithms is dependent on the choice of the under- lying features, it is vital that only those features that accurately characterize the phishing e-mails are employed. As phishing e-mail detection is a binary classification problem, ideally the chosen feature set must satisfy the following two properties: (i) Inter-class exclusivity and (ii) Intra-class completeness. The inter-class exclusivity ensures that the chosen features predominantly appear

34 in the phishing e-mails and not in the legitimate e-mails. On the other hand, intra-class complete- ness ensures that the chosen features encompass all the invariant characteristics that constitute a phishing e-mail. While the inter-class exclusivity focuses on decreasing the false positive rates, the intra-class completeness improves the detection rates. However, the features used in the current machine learning approaches fail to address these properties. For example, existing bag-of-words classifiers that attempt to classify phishing e-mails perform poorly because frequent words, such as bank, credit card, account, etc., that appear in the phishing e-mails also appear in their legiti- mate counterpart. Moreover, legitimate e-mails are composed in HTML and have similar structural traits when compared with phishing e-mails, thereby making the classification hard. As a result of poor selection of input features, these approaches suffer from degraded performance when tested against the legitimate financial e-mails. The link common to all e-mail based phishing attacks is that they are more or less a form of social engineering attack. Even though, they have relied on other attack vectors such as browser and DNS vulnerabilities to trick users, they predominantly prey on the users’ gullibility. The content of the phishing e-mail imposes a sense of urgency and threat (account suspension) or lure and cajole (reward for completing a survey) on its victims making them act according to the falsified instructions provided in the e-mail content. In order to address the root of the problem, this chapter proposes a novel methodology to classify phishing e-mails based on the underlying “tone” of the e-mail message. For this purpose, stylometric1 and structural features that constitute the linguistic composition of the e-mail are first extracted. These features are then passed as input into three popular machine learning algorithms, viz. (i)naïve Bayes classifier; (ii) support vector machines (SVM); (iii) decision trees. Experimental results on live phishing and ham e-mail dataset show that the use of linguistic features increases the detection rates with negligible false positive rates. 1Stylometry is a statistical technique that analyzes common linguistic patterns in the text to determine its author(s). Typically, features such as average sentence length, average word length, number of n-grams words, Part-Of-Speech (POS) statistics, etc., are used as basis for classification.

35 3.1.1 Contributions

The main contributions of this chapter are as follows:

• Customizable and usable spam filter to detect phishing e-mails. First, the effectiveness of five existing anti-spam efforts in detecting phishing e-mails is studied. This serves as a strong motivational factor to develop sophisticated solutions that focus entirely on detecting phishing e-mails. Then, a customizable and usable spam filter is proposed that can act as a first of line defense to separate ham and phishing e-mails based on the user specific informa- tion contained in e-mail content. Finally, to validate the feasibility of building such a filter, a survey of user specific information contained in top 20 most phished brands is obtained. This filter has been implemented as an add-on to Microsoft Outlook – a popular Microsoft Windows based e-mail client. Preliminary results show that they are effective in detecting present day phishing e-mails.

• Supervised machine learning algorithms to detect phishing e-mails based on linguis- tic features present in them. Using 12 different linguistic features in conjunction with structural and textual features, discriminator models are built using three popular supervised machine learning algorithms. These discriminator models are tested against three different phishing and ham datasets. Experimental results show that these models can detect phishing e-mails with over than 90% accuracy with less than 2% false positive rates.

• Context sensitive warning generation system to educate users about the working of phishing attacks. Since the linguistic features rely on capturing the underlying tone of the e-mail message, assigning a semantic context to the tone and communicating it as a warning message back to the user helps educate them about the modus operandi of phishing attacks. As it is the users who take the ultimate decision of giving out their sensitive information, such a means of education would allow them to be conscious of their actions.

36 3.1.2 Chapter Organization

This chapter is organized as follows: Section 3.2 discusses the commonly adopted attack vectors used by the phishers to launch phishing attacks. Section 3.3 provides insight on the limitations of present day spam filters in detecting phishing attacks. Section 3.4 focusses on building customiz- able and usable spam filters that can be used to detect phishing e-mails. The anatomy of phishing e-mails is discussed in Section 3.5 to identify invariant features that are unique to phishing e-mails. Section 3.6 focuses on detecting phishing e-mails based on their linguistic and structural proper- ties. The details on generation of context sensitive warnings is given out in Section 3.7. Lastly, concluding remarks are provided in Section 3.8.

3.2 Commonly Adopted Phishing Attack Vectors

Attack vectors are means by which attackers can gain illegal access to system or user information by either exploiting system vulnerabilities or ‘human element.’ In order for users to fall prey to phishing attacks, phishers employ a series of attack vectors that redirect them into fake Websites. In this section, these attack vectors are reviewed in detail to get a better understanding about phish- ing attacks. URL and Host Name Obfuscation Attacks – Phishing attacks use spoofing techniques to con- ceal fake URLs such that an unsophisticated user cannot distinguish them from legitimate ones. One trivial deception method is to register a fake domain that is a misspelled variant of the le- gitimate one. For example, a phisher can host the phishing site from www.paypai.com to forge www.paypal.com (replacing lower case ’L’ with ’i’). Third party service exists that can shorten URLs so that it can be compatible with existing e-mail and Web application systems. Such short- ened URLs lose their identity and are represented as if hosted within their providers’ domain. For example, www.google.com/accounts and http://moneycentral.msn.com/banking/accounts can be mapped to http://tinyurl.com/8ydws and http://tinyurl.com/2lmgd2 respec- tively by using obfuscation service available through http://tinyurl.com. Other form of visual

37 deception can be brought about by replacing ASCII characters with special encoded-coded char- acters that use DWORD, HEX, UTF-8 encoding [60]. Browser Vulnerabilities – Browsers in their quest to support extended features and functional- ities provide hooks to install unauthorized third party plug-ins and add-ons. As these add-ons usually operate with the same security privilege as the browser, attackers can essentially exploit vulnerabilities present in them to hijack the browser. With new vulnerabilities being discovered and patches released, it is extremely difficult for a naïve user to constantly update and protect against the attacks. Poorly configured browsers installed with third party plug-ins are suscepti- ble to homographic attacks like International Domain Name (IDN) spoofing and pop-up hijacking attacks. Also, vulnerabilities in ActiveX controls and browser helper objects (BHO) can install trojans which can modify the system’s host file to redirect the users to a fake Web site. Disabling the features like ActiveX controls, unauthorized third party plug-ins, IDN support is often viewed as a trade-off between extended functionality and security. Cross-site Scripting (XSS) and Session Hijacking Attacks – Phishers exploit the security loop- holes in Web applications and Web server’s software to make the users unknowingly execute ma- licious scripts. These scripts are usually embedded through encoded characters in the URL for the purpose of redirecting the users to a malicious server. For example, a user might click on the www.legit.com/account?URL=www.fakebank.com assuming it to be a part of the legitimate bank itself. Here the user is first directed to the legit.com Web site. But due to the coding flaws, the account accepts arbitrary URLs and redirects the control over to the www.fakebank.com. Also, by installing packet sniffers and extracting session ID from the server side, it is possible for the phishers to hijack the user’s current session. Pharming and Host Redirection Attacks – Pharming is a domain redirection attack wherein attackers modify the victims Domain Name Service (DNS) infrastructure so that the users are redi- rected to a fake Web site even when the legitimate URLs are keyed in correctly, exploitation of vulnerabilities in DNS software, surreptitious modification of host files in victims’ computer and router can be used to achieve pharming attacks. Phishing attacks that use domain redirection are

38 difficult to detect as they do not depend upon obfuscation techniques that trick the users into visit- ing the fake Web site. Botnets and Malware based Phishing Attacks – Botnets are a collection of compromised ma- chines that act under the attackers’ command-and-control. These machines are then used to send out spam and host phishing Websites through the dynamic DNS service. As these botnets do not directly root back to an attacker, they can be used to surreptitiously send out phishing e-mails and launch fake Webpages. Malware, other than recording the users’ keystrokes and input data, can capture the users’ browsing history to send out targeted phishing attacks known as ’spear phishing’. Even though redirection of users into fake Websites can be achieved by involuntary means without their explicit consent, i.e., by exploiting browser vulnerabilities or through compromise of DNS server or local DNS resolver, due to the ease of execution and high hit rate, phishers rely on social engineering attacks that often exploit the “human element”, also regarded as the weakest link in the security chain. Therefore, in remainder of this chapter the main focus is on building effective and scalable solutions that add an additional layer of security to prevent the users from falling prey to these social engineering based phishing attacks.

3.3 Why Existing Anti-spam Based Approaches Fail? A Case

Study

Traditionally, anti-spam mechanisms were used to address phishing attacks. Although phishing e-mails can be regarded as unsolicited junk, they do not share the same characteristics as spam e-mails and thereby require specialized filters for classification. As most of the present day users either rely on Web based or desktop based clients to read e-mails, it is imperative that these clients offer some sort of specialized anti-phishing support. Web based clients have customized algorithms that blacklist suspect e-mails based on their origin and the information presented in their content. However, actual algorithm and feature set used for classification are not disclosed publicly due to the fear of reverse engineering. On the other hand, desktop based clients are built-in with a set

39 of generic spam filters (rules) to separate unsolicited e-mails from legitimate ones. Until recently, these desktop clients focused solely on identifying spam e-mails that exhibit different set of features than normal e-mails. Since phishing e-mails bear close resemblance to legitimate e-mails, they are not stopped usually by these filters. To tackle this problem, a set of ad hoc rules were incorporated into these clients specifically to target phishing e-mails. As these rules are loosely formed they instead suffer from large false positive rates. As an alternative, users are provided with options to tweak the existing rules to suit to their needs. For example, a user could tag all e-mails sent from banks that he is not enrolled with as phishing. Also, filters could be used to reject e-mails from banks that do not carry the users’ personal information such as fullname, last four digits of account number, etc. Unfortunately as the burden of implementing the rules is delegated on to the users, even such personalized rules are not completely foolproof.

3.3.1 Performance of Current Anti-spam Filters in Tackling Phishing

Attacks

In this section, effectiveness of five popular desktop based spam filters in detecting phishing e- mails is studied to emphasize the need for specialized filters that concentrate exclusively on de- tecting phishing e-mails. For the purpose of testing, the filters were tested against 577 ham e-mails sent by legitimate financial institutions. This dataset comprises of e-mails from companies such as HSBC, Amazon, eBay, American Express, Citibank, ICICI bank, Google Checkout etc. The actual composition of this dataset is given in Figure 3.2. The testing done here primarily fo- cuses on bringing out large false positive rates of present day anti-spam filters, which render these filters unusable for detecting phishing e-mails. Figure 3.3 gives out the performance of these filters in terms of their false positive rates. Mozilla Thunderbird identifies phishing e-mails based on two key criteria viz. (i) presence of IP based URLs; and (ii) discrepancy in visible and hidden URLs. For example, 95% of e-mails from Citibank contain different visible and hidden URLs and are misclassified as ‘phishing’ by

40 Figure 3.3: False positive rates of popular anti- Figure 3.2: Breakup summary of various finan- spam filters while detecting phishing e-mails. cial institutions contributing towards the 577 hard The filters were tested against 577 hard bank e- ham dataset mails

Mozilla Thunderbird. Vipul Razor [98] is a hash-based collaborative spam filtering technique, which is supported by large number of text based e-mails clients. For an e-mail to be tagged as spam, its hash value needs to match with the hash value of the known spam e-mail stored in a distributed database. This distributed database is constantly updated by acquiring spam messages from various spam sources. Even though Vipul Razor exhibits 0% false positive rate, it performs poorly while detecting actual phishing e-mails also. Vipul Razor could only detect 5 of the 414 phishing e-mails obtained from a popular phishing corpus [91]. SpamAssassin is another popular tool used by both MUA and MTA to detect spam e-mails. Unlike rule based filters, SpamAssas- sin employs a set of heuristics based on colloborative filtering, whitelists and Bayesian learning to detect unsolicited e-mails. In order to test SpamAssassin’s false positive rates, a similar ap- proach as given in [49] was adopted. SpamAssassin was used in untrained mode, and e-mails with sa_score (spam score) of greater than 5 are tagged to be phishing. In this case, slightly over 35% of ham e-mails were inaccurately classified as phishing. As most of the conventional anti-spam fil- ters employ Bayesian classifier for detection, these 577 ham e-mails were also tested against naïve Bayesian classifier. The classifier was trained on 1-gram words from a total of 6950 ham e-mails

41 obtained from SpamAssassin dataset and a total of 4560 phishing e-mails in the Phishing cor- pus. For detailed theory on naïve Bayesian classifier refer to section 3.6.2. The learned Bayesian model was then tested against 577 ham e-mails because of their similarity in appearance to the phishing e-mails. However, most of these ham e-mails were incorrectly classified as phishing. Precisely speaking, only 3 of the 577 were correctly recognized as ham making these classifiers a poor choice in filtering out phishing e-mails. Lastly, performance of Microsoft Outlook Express, a popular desktop based e-mail client that comes free with Microsoft Windows XP, in counter- ing phishing attacks is studied. Despite its popularity, it does not contain any filters to weed out unwanted e-mails thus offering no resistance against even the simplest of phishing attacks. On a whole as these filters exhibit poor performance, they cannot be adopted in their current state for phishing e-mail detection. Furthermore, these filters do not provide any customized warn- ings that offer users insight about the working of phishing attacks. To address these limitations, subsequent sections focus on devising efficient solutions that not only have high detection rates, but also very negligible false positive rates making them extremely usable.

3.4 CUSP: Customizable and Usable Spam Filters to Detect

Phishing E-Mails

As mentioned earlier, the success of a phishing attack depends on the users’ inability to accurately tell legitimate and spoofed e-mails apart. Even though phishing e-mails closely resemble their le- gitimate counterpart, as they are composed in bulk, they lack any information in their content that can relate the e-mail addresses with the account holders personal information. On the contrary, as e-mails from legitimate institutions carry user specific information that are not known pub- licly (such as transaction identifiers (tids), abbreviated version of their account number, full name, date-of-birth, address, etc.), they, in turn, can be put to use as security certificates to validate the sending domain’s legitimacy. In this section, Customizable and Usable SPam (CUSP) filter is proposed, which strives to

42 detect fraudulent e-mails based on the private user specific information present only in legitimate e-mails. The idea behind CUSP is quite simple. CUSP allows users to store private data on a per organization/account basis. Subsequently, every incoming e-mail that purports to originate from the stored organization is examined to see if it contains the user stored private data. If there is a mismatch (or if the private data is absent), the e-mail is deemed as suspicious. The notion of verifying the sender’s domain for detecting spoofed e-mails is not new; there exists mechanisms like SPF, Sender ID and DKIM that validate the sending domain using IP addresses or digital sig- natures. Although these mechanisms can vouch for the sending domain’s reputation, they still fail to stop users from falling prey to phishing attacks. Furthermore, unlike CUSP, these mechanisms are heavyweight – each of them adopts a different protocol that requires changes to the existing e-mail infrastructure. In order to demonstrate the feasibility of CUSP, a brief survey on the user specific data con- tained in the e-mails from legitimate institutions is presented. This would help in identification of private data that needs to be stored in CUSP so that accurate prediction of phishing e-mails is possible. For this survey, top 20 most phished brands in 2007, as reported by Phishtank is con- sidered. Phishtank, a collaborative undertaking of academia and industry, operates by assimilating and publicizing phishing e-mail feeds, which are then verified by the interested subscribers. Out of these 20 brands, 17 are online banks and credit card institutions. The remaining three are pop- ular Internet portals that support e-business. The summary of the findings are presented in Table 3.1. All the 20 brands claim that they do not send e-mails to the customers requesting their per- sonal credentials. Furthermore, the banks’ Websites clearly state that any e-mail carrying such information on their behalf is a fraudulent one. Majority of the banks also claim to send out per- sonalized e-mails to the customers (i.e., having information such as their last/full name, last four digits of their account number, and occasionally their home address). However, mixed responses was obtained when questioned if such data can be used for validation purposes. While most of the banks advised the customers to use this data as one of the “visual indicators” to identify spoofed e-mails, one bank cautioned otherwise citing “spear phishing” as the example. It is important to

43 note that even though it may be possible for an attacker to launch targeted phishing attacks (spear phishing) by using the recipients’ private data obtained through other means, they are usually rare due to the difficulty involved. A recent study involving real human subjects shows that the users place implied trust on personalized e-mails [65]. Although the underlying intention was correct, the subjects were not able to make a clear distinction on whether the personalized data is actu- ally the private data (i.e., not publicly known). For example, the subjects incorrectly trusted the e-mails that contained first four digits of the credit card number, even though first four digits are not random and are dependent on the card issuer.

3.4.1 Overview of CUSP

In this section, detailed working of CUSP is presented along with how it can be used to detect fraudulent e-mails from known institutions. CUSP is built as a plug-in for Microsoft Outlook in C# using Visual Studio Tools for Office 2003 (VSTO). CUSP attaches itself to the e-mail client, and is bootstrapped with a list of popular institutions that are prone to phishing attacks. At the time of installation, if a user is subscribed with any of the preloaded institutions contained in CUSP, then he is required to specify the corresponding user specific data that are to be included in legitimate e-mails from them. Figure 3.4 shows a form in CUSP requesting such information from the user. In case if an institution of user’s choice is not present in CUSP, then he can add a custom tag to include it. Similarly, as there is no common consensus among institutions on what user specific data are to be included in their e-mails. Users are also provided with options to add/modify the existing tags representing different fields such as his address, product key, date-of-birth, etc. It might be difficult for a naïve user to figure out beforehand on what data need to be included in CUSP corresponding to a given institution. Ideally, such information need to be updated by the software provider, as opposed to the user. The user specific data are hashed and stored in CUSP similar to the way in which values for auto-complete fields are stored in a browser. Any e-mail that purports to originate from the preloaded organizations are examined to see

44 Table 3.1: A survey of top 20 phished brands’ security policy. All the listed companies indicate that they do not send e-mails requesting confidential information. Token identifiers indicate what user specific data is included in the companies’ e-mail to the customers (FN - Full Name, UID - Username, LF - Last four digits of Account Number, NA - No private data, U - Unverified/Not known.) Source indicates where the information about companies’ security policy was obtained (W - Website, S - Sample e-mail, P - Phone Conversation.)

Name of the Bank Token Sends E-Mail Source URL Identifier Requesting Private Information PayPal FN No W, P http://www.paypal.com/cgi-bin/webscr?cmd=p/gen/fraud-prevention-outside eBay FN, UID No S, W http://pages.ebay.com/education/spooftutorial/ Barclays Bank FN, ADR No W http://www.personal.barclays.co.uk/BRC1/jsp/brccontrol?task=homefreevi2 \&value=9117\&target=_blank\&site=pfs Bank of America Corporation U No W https://www.bankofamerica.com/privacy/Control.do?body =privacysecur_e-mail_fraud Fifth Third Bank FN No W https://www.53.com/wps/portal/privacy/?New_WCM_Context=/wps/wcm/connect /FifthThirdSite/Global+Utilities/Privacy%20%26%20Security/# 45 JP Morgan Chase and Co FN, LF No S, W http://www.chase.com/ccp/index.jsp?pg_name= ccpmapp/shared/assets/page/Report_Fraud#5 Wells Fargo U No W https://www.wellsfargo.com/privacy_security/fraud/report/fraud? _requestid=394409 Volksbanken Raiffeisenbankeni U No W http://www.vr-networld.de/c132/default.html Branch Banking and Trust Comp FN, LF No W http://www.bbt.com/bbt/about/privacyandsecurity/e-mailcommunication.html Regions Bank U No P http://www.regions.com/about_regions/e-mail_fraud.rf Wachovia FN, LF No P http://www.wachovia.com/securityplus/page/0,,10957_10970,00.html HSBC Credit Card FN, LF No S, W http://www.us.hsbc.com/1/2/3/personal/inside/securitysite/ /your-responsibility National City U No W http://www.nationalcity.com/about/privacy/identity/default.asp Amazon.com NA No S,W http://www.amazon.com/gp/help/customer/display.html?nodeId=15835501 Poste Italine U No W http://www.poste.it/online/phishing.shtm Citibank FN, LF No S, W https://www.citicards.com/cards/wv/detail.do?screenID=607 US Bank FN, LF No W https://www4.usbank.com/internetBanking/en_us/info/ BrowserRequirementsOut.jsp Capital One FN, LF No S http://capitalone.com/fraud/prevention/phishing.php?linkid =WWW_Z_Z_Z_FRD_C1_01_T_FPRV1 HSBC Bank N, LF No S, W http://www.us.hsbc.com/1/2/3/personal/inside/securitysite/ your-responsibility Western Union U No W http://www.westernunion.com/info/fraudProtectYourself.asp Figure 3.4: CUSP requesting the user to enter private data corresponding to the subscribed institu- tion if it contains the relevant user specific data. If it does, then the e-mail is tagged “safe" and sent to the user’s mailbox. On the other hand, if the user specific data is missing or is incorrect, the e-mail is tagged as “phishing." If a user fails to enter the required information at the time of installation, a dialog box is prompted asking for relevant information as shown in Figure 3.5. The user, also, has an option of regarding the e-mail message as not a financial institution. Using this option indiscriminately exposes risk of the user falling prey to phishing attack. For an e-mail to be considered as a likely candidate for analysis, its TF-IDF (Term frequency - Inverse document frequency) score should be closely associated to e-mails from financial institutions. TF-IDF score

46 Figure 3.5: A modal box interrupting the user to enter the required user specific data before opening the e-mail indicates how important a particular term (or word) is to a document in a corpus. TF-IDF is computed as TF − IDF = TF.log{IDF}, where TF indicates the number of times a particular term appears in a document and IDF indicates the number of documents the particular term appears in. In other words, TF gives out how important a particular term is to a document, where as IDF gives out the over all importance of a particular term across all documents. Hence, terms that occur rarely across all documents, but frequently in phishing documents have a high TF-IDF score. Performance of CUSP

To evaluate performance of CUSP, a publicly available phishing corpus [91], which contains 434 phishing messages collected in a period of five months is considered. These e-mails are passed through the preprocessing engine described before to eliminate ill-formed e-mails that were not composed in English. Also, for the sake of brevity, messages with significant amount of spam (junk words) were discarded. The final list thus formed is reduced to a total of 362 phishing e- mails. Almost all of the e-mails did not contain any (even random) user-specific data, barring a few exceptions. A small fraction of e-mails (< 2%) that impersonated eBay had the user’s full

47 name along with the corresponding user id. This is because eBay lists the user id in users listing along with their e-mail information making it easier for phishers to harvest them. Also, some of the e-mails had fake case numbers to make them appear as though sent by the institution’s security department. Moreover, our tool exhibited 0% false positive rates as none of the e-mails that contain correct user specific private data are tagged ‘phishing.’ However, there are three main limitations with CUSP: (i) First, as of now, the list of institutions that are vulnerable to phishing attacks needs to be directly hard-coded in CUSP. Even though users are provided with an option to add their own custom tags, to make it more scalable, it is essential that these tags are managed remotely by a centralized system so that all the users can benefit from it; (ii) Second, as CUSP operates only on text messages, it is still possible for a phisher to evade detection by encoding spoofed e-mails as images. To address such cases, explicit warnings are provided to users instructing them not to give away confidential information in response to such e- mails; and (iii) Lastly, only e-mails that seem to originate from financial institutions are examined. CUSP can be easily evaded by carefully fabricated phishing e-mails having different TF-IDF score from financial e-mails. In order to overcome the limitations of CUSP it is important that defense solutions focus on many other features that are pertinent to phishing e-mails, and are not limited to just private data. Moreover, to be effective, it is essential that these solutions are not rule based, and are based on machine learning algorithms that build discriminator models based on the training data.

3.5 Anatomy of Phishing E-Mails

In order to devise defense solutions that can detect phishing e-mails, it is important to chart out invariant properties that are present in most, if not all, of the phishing e-mails. These invariant properties are mostly visual deceptive agents employed by the phisher to trick the users. Identi- fying these invariant properties also helps in building proper feature set for classification that are accurate and less prone to false positives. Figure 3.6 illustrates these invariant features that are

48 embodied in a PayPal phishing e-mail. Spoofing of Online Banks and Retailers – Phishing e-mails closely imitate online banking and retailers to gain the trust of the users. The companies spoofed most often are Citibank, eBay, and PayPal. For a more comprehensive list refer to Table 3.1. Also, the targeted industries are financial services, Internet retailers and Internet Service Providers. Phishers adapt quickly and target organizations such as Internal Revenue Service (IRS) and charity organizations that are not safe-guarded. Usually, the company’s image and links referring to the company’s Web site are spoofed in the fake e-mail to deceive its customers. Non-Matching URLs – In spoofed e-mail messages, the link text seen in the e-mail is usually different from the actual link destination. In the following example, though the e-mail refers to http://www.chase.com it redirects the user discretely to the site http://www.climagro.com.ar /agro/chase.htm which is the actual referred Web site http://www

.chase.com. Age of Referred Domains – Most of the phishing Web sites are hosted using free Web hosting services or from compromised machines that are running dynamic DNS service. In general, these fake sites are short-lived; they are detected automatically by monitoring bots or taken down as a result of users’ complaint. Hence, uptime of most of the phishing domain is less when compared to their legitimate counterpart. WHOIS query on a domain gives out the date at which it was registered along with their location information. The date and location can then be used to determine whether it is a phishing site or not. For example, Fette et al. [49] mark the domains registered in the past 60 days as phishing. Using IP Addresses instead of URLs – Frequently, phishers attempt to conceal the destination Web site by scrambling the URLs so that it is hard for normal users to tell it fake. One method of concealing the destination is to use the IP address of the Web site, rather than the hostname. An ex- ample of an IP address used in a fraudulent e-mail message’s URL is http://210.14.228.66/sr/. Generalization in Addressing Recipients – As the success of e-mail based phishing attacks de-

49 Figure 3.6: Anatomy of a spoofed Paypal phishing e-mail illustrating various intrinsic features used to deceive users

50 pend on their reachability to vast number of recipients, most of the phishing e-mails do not contain personalized content while addressing their potential victims. Unlike legitimate business commu- nication, they do not address the customers using their names for identifiers, and lack embedded scrambled information such as ‘last four digits of account information’, which is used to establish authenticity. Although, it might be possible for a phisher to include this information, by employing social engineering and other malpractices, the success rate of such attacks is limited. Altering the Tone of Message Body – As the objective of phishing e-mails is to trick the users into divulging their confidential information, phishers modify the tone and underlying meaning of the message body to (i) invoke a sense of false urgency - a user may be instructed to revalidate his account information in the masqueraded Web site within the 24 hour period, (ii) invoke a sense of threat - phishing messages may threaten the users into divulging their confidential information to prevent account revocation, (iii) invoke a sense of concern - in their e-mails, phishers may imply false security promises such as weak password change to trick the users in changing the passwords in the fake Web site, (iv) invoke a sense of opportunity/reward - phishers might lure victims to reveal their information as a part of the survey which credits money to their accounts.

3.6 Using Machine Learning Algorithms to Classify Phishing

E-mails

In order to classify phishing e-mails using machine learning algorithms, it is essential to identify the set of features used for classification beforehand. Although phishing e-mails have used browser based and DNS vulnerabilities to trick the user, they are primarily a social engineering attack. The content of the phishing e-mails use an implied sense of urgency and threat (account suspension) or lure and cajole (reward for completing a survey). If the sense of threat/lure is conveyed in a sufficiently efficient manner, the naïve user can fall prey to the attack. This social engineering tactic is the defining characteristic of phishing e-mails. In this section, a novel methodology to detect phishing e-mails with a combination of the previously reported features of phishing e-mail

51 along with a linguistic analysis of the phishing e-mail content in order to detect the ’tone’ or the implied message of the e-mail is proposed. Identification of this implied sense of threat/lure, or more generally, the ’tone’ of the e-mail, serves as a critical factor towards not only classifying phishing e-mails, but also communicating an accurate status of the import of the e-mail to the end user. Consider the phishing e-mails that get past the standard phishing filter: if our framework can provide a meaningful communication to the user regarding the intentions of the e-mail originator, it would not only be an effective methodology to defeat the attack, but also educate users to the potential harmful effects, which, after all, is the key to defeating these attacks.

3.6.1 Features Used in Classifying Phishing E-Mails

There are three set of features used to classify phishing e-mails, viz. (i) textual features; (ii) lin- guistic features; and (iii) structural features. The textual features and linguistic features identify phishing e-mails based on the word composition and their grammatical construction. On the other hand, structural features focus on identifying the presence of obvious cues present in the e-mail body, which implicate it to be spoofed.

Textual Features

Textual features treat the words occurring in e-mail body as independent tokens. These individ- ual tokens are chosen using ranking metric such as TF-IDF that selects words that often occur in phishing e-mails, but not so commonly in ham e-mails. A problem with individual (1-gram) tokens is that they suffer from large false positive rates, especially when tested against e-mails that closely resemble phishing e-mails in their word content. For example, 1-gram words such as bank, account, username, password, etc., are tagged to be phishing as their occurrence probability in ham e-mails is particularly low. However, such coarse generalization do not scale well as ‘ham’ e-mails sent by legitimate organizations also do contain similar keywords. To alle- viate this problem, depending on the position of occurrence and proximity to each other 1-gram tokens are grouped to form complex meta-tokens of size n words (n-gram) tokens. As the proba-

52 bility of n-gram tokens to be part of both phishing and ham e-mail is low, they decrease the over all effect of classification noise. Usually, due to the complexity involved in generating n-gram tokens with large values of n, most of the text mining implementation restrict the size of n to be 3 or less.

Linguistic Features

The use of linguistic features for classification is widely adopted in the field of NLP. Linguistic features capture the essence of message conveyed in phishing e-mails through which phishers lure the victims into visiting fake Websites. Even though the formatting of phishing e-mails is similar to genuine e-mails, it’s the message conveyed in them, which acts as social-engineering catalyst, that separates them. Linguistic features enrich the semantic value of the text based features, thus in- creasing classifier’s accuracy. Moreover, linguistic features have been used in detecting deception in human conversations, which is a more generalized version of phishing problem. In the case of phishing attacks, the linguistic features are extracted from the part-of-speech (POS) and word composition statistics computed from the underlying e-mail body. For this pur- pose, features composed by Zhou et al. [134], as used in [96] are directly adopted. These features have found application in other security areas such as: (i) combating web spam [96]; and (ii) online monitoring of security events [18]. These linguistic features can be clustered into the following constructs: Quantity – Quantity is used to identify parts of speech that represents the relevant “objects” of e-mails and the “actions” taken on the objects. In the context of phishing e-mails, objects are rep- resented by noun words such as bank, account, password, username, etc., and actions are represented by verbs such as revoke, disclose, cancel, etc. Generally, objects and actions are encapsulated by the following three terms: (i) Ratio of nouns (Noun fraction); (ii) Ratio of verbs (Verb fraction); and (iii) Ratio of pronouns (Pronoun fraction). Words such as supply that can be classified as both noun (N) and verb (V) are given a separate tag NV. Complexity – Complexity construct represents the syntactic features of the message body that are used to separate phishing e-mails from legitimate e-mails. These syntactic features include:

53 Total # of words Total # of characters (i) Average sentence length = Total # of sentences ; (ii) Average word length = Total # of words ; and (iii) Total # of punctuation marks Pausality = Total # of sentences . Diversity – Diversity represents the vocabulary richness of e-mail messages. Diversity can be used to pan out the redundant content that is usually present in unsolicited e-mails. It is also used for tagging related e-mails together. Diversity is usually represented by (i) Lexical diversity =

Total # of different words Total # of unique nouns and verbs Total # of words ; and (ii) Content word diversity = Total # of nouns and verbs . Expressivity – Expressivity denotes the emotiveness of the language. It is denoted by the follow-

Total # of adjectives + Total # of adverbs ing ratio: Expressivity = Total # of nouns + Total # of verbs . Non-immediacy – Non-immediacy refers to the level of verbal indirectness (voice) with which

Total # of passive verb constructs the e-mail is composed. Non-immediacy is given by (i) Passive voice = Total # of verbs ; Total # of 1st person pronouns and (ii) Self-referencing = Total # of pronouns . Uncertainty – Uncertainty in sentences is captured by the presence of modal verbs. These verbs are associated with notions of possibility and necessity. For example, the verbs such as ‘could’, ‘may’, ‘might’, etc., are construed as modal verbs. Hence, uncertainty in a passage of text is

Total # of modal verbs denoted as: Uncertainty = Total # of verbs .

Structural Features

The structural features are binary features that encompass invariant characteristics commonly present in the layout of e-mail messages. There are three popular structural features considered here, viz. (i) Generalization in addressing recipients – A large fraction of the phishing e-mails do not address recipients by their full name, instead use a generalized term such as member, customer, etc.; (ii) Dotted IP URLs – Most of the phishing attacks host spoofed Websites re- ferred to by dotted IP URLs as opposed to URLs with fully qualified names; and (iii) Difference in hidden and visible URLs – In order to trick the users, phishers encode hidden URLs that are different than the visible URLs using the HTML < a href>... tags.

54 3.6.2 Detection Algorithms

Supervised machine learning algorithms are popular choice to classify phishing e-mails. In super- vised machine learning algorithms, given labeled e-mails as training data (i.e., e-mails marked as either phishing or ham), the goal is to learn a classification function so that it can predict the class labels of unknown test data (i.e., tag incoming e-mails as phishing or ham). Formally speaking,

given an input feature set X = {x1, x2, x3,..., xn}, where each input feature x denotes a property

that is either present or absent in phishing e-mails, and output class label set {c1, c2,..., cn}, where each c denotes a unique class labels, the goal is to learn a function h : X → C so that h(x) is a good predictor for the corresponding value of c. Since the problem of phishing e-mail classification is a binary one, there are only two class labels (c1 = phishing or c2 = ham). In this section, three popular supervised machine learning algorithms that are used in classification of phishing e-mails are discussed. Naïve Bayes Classifier – Due to its simplicity and ease of implementation, naïve Bayes classifier has been widely used in the field of spam and phishing classification. Naïve Bayesian classifier is a probabilistic classifier, which operates based on Bayes theorem under the conditional independence assumption that every input feature is independent of each other. Given an instance consisting of a set of input features X = {x1, x2,..., xn}, and output set of class labels C = {c1, c2,..., cn}, naïve

Bayes classifier assigns the input instance a class label ci such that

Pr(ci|X) > Pr(c j|X) ∀i , j (3.1)

Since Bayes rule relates Pr(ci|X) to Pr(X|ci), equation 3.1 can be rewritten as,

Pr(c|X) = arg max Pr(c)Pr(X|c). (3.2) c∈C

With the strong independence assumption Pr(X|c) reduces to Pr(X|c) = Pr(x1, x2,..., xn|c) =

55 Yn Pr(xi|c). Therefore, equation 3.2 becomes, i=1

Yn Pr(c|X) = arg max Pr(c) Pr(xi|c) (3.3) c∈C i=1

To enable prediction, it is essential that the probabilities Pr(xi|c) and P(c) are available. Pr(xi|c) can be estimated from the training dataset as maximum likelihood estimate (MLE). It is calculated as the fraction of number of training instances in class c that contain the feature xi to the total number of training instances that contain xi across all classes. In other words,

nic Pr(xi|c) = (3.4) nc

where nic denotes the total number of training instances in class c containing feature xi and nc denotes the total number of training instances that contain xi.

In cases where the nic = 0 MLE suffers as Pr(xi|c) becomes zero. In order to avoid this, Lapla- cian (add-one) smoothing is used, which avoids zero probability by adding one to the frequency counts in the numerator and denominator for each class.

nic + 1 Pr(xi|c) = (3.5) nc + |C|

For large values of nic and nc, Laplacian smoothing converges to MLE. It is important to note

that each xi can take only discrete values, either 0 or 1, indicating the presence or absence of the corresponding feature. The continuous value variables need to discretized first. Discretization can be done either in supervised or unsupervised fashion using ad hoc selection of bins or binning guided by information in training data respectively. Another important factor is that since phishing e-mail is a binary classification problem, the cardinality of the set C is two, and C = {±1}, where +1 denotes that the e-mail is phishing and −1 denotes that the e-mail is ham. Support Vector Machines – Support Vector Machines (SVM) are well suited for binary classi-

56 fication problems. They have been applied in tasks such as text classification, object recognition, and anomaly detection with great success. SVMs were developed by Vapnik et al. [118] based on the idea of structural risk minimization principle derived from statistical learning theory. In this section, the underlying theory behind SVMs as presented in [67] is summarized for the sake of better understanding. The idea of structural risk minimization is to find a hypothesis h from a hypothesis space H such that it is possible to guarantee the lowest probability of error Err(h) for a given input S n. The input consists of a set of training examples of the form S n = {(~x1, y1), (~x2, y2),..., (~xn, yn)}, where

N each ~xi ∈ R represents the feature vector and yi ∈ {±1} indicates if the feature vector is a positive or negative example. In the simplest form, SVMs learn linear decision rules h(~x) = sign{w~~x + b} described by weight vector w~ and threshold b. SVMs strive to find a hyperplane that separates the positive and negative examples with maximal margin. In linearly separable case, SVMs finds the hyperplane such that it is at maximum Euclidean distance to the closest training examples. This distance is called the margin δ as given in Figure 3.7.

Figure 3.7: A 2-dimensional Binary Classification with Linear SVM

The examples that lie nearest to the hyperplane are known as support vectors. In the case of

non-separable S n, the amount of training error is measured using slack variables ξi. The problem of computing the hyperplane can be reduced into the following primal optimization problem.

57 Figure 3.8: An example decision tree to classify phishing e-mails

OPTIMIZATION PROBLEM 1. (SVM (PRIMAL))

1 Xn minimize: V(w~, b, ξ~ ) = w~w~ + C ξ (3.6) i 2 i i=1 n subject to: ∀i=1 : yi[w~.x~i + b] ≥ 1 − ξi (3.7)

n ∀i=1 : ξi > 0 (3.8)

The constraint 3.7 dictates that all the training examples are classified correctly bounded by the slack ξi. In cases where the training example lies on the wrong side of the hyperplane, the corre-

Pn sponding ξi is greater or equal to 1. Therefore, i=1 ξi gives out upper bound on number of training errors. Instead of solving this optimization problem directly, it is easier to solve its dual [118].

Also, there might be cases where S n is not linearly separable. In such cases, SVMs can easily be generalized to generate non-linear decision rules. Further discussion on this topic can be found in [67] Decision Trees – A decision tree is a hierarchical model for supervised learning, which takes a set of attribute-value pairs as input. The internal nodes of the decision tree denote tests on attributes.

58 Each branch in the tree represents the outcome of a test, and leaf nodes represent the respective output classes. The tests usually operate on the empericial values of the attributes supplied as in- put. Figure 3.8 illustrates an example decision tree used to classify phishing e-mails. When the number of attributes are huge, information gain measure is used to select the best test attribute at each node in the tree. Such a measure is referred to as measure of the goodness of split. These attributes also minimize the total information needed to classify the samples and parti- tions the decision trees into set of near optimal partitions reducing randomness or “entropy.” De- cision trees can be automatically generated from the training attribute-value pairs using induction algorithms such as ID3 algorithm. As opposed to other complex algorithms which mask the under- lying classification procedure, decision trees are simple to interpret and understand, and provide a clear picture on the role of each attribute in classification.

3.6.3 Experimentation

Before applying the aforementioned data mining algorithms for classification purposes, it is es- sential that the datasets are sanitized. Specifically, unwanted HTML tags, attachments and MIME elements that do not play any role in the classification are removed so that the features can be extracted from the remaining text body in an efficient manner.

Dataset Sanitization

There are three datasets used in the experimentation: (i) The phishing dataset is obtained from an open repository, which consists of 4550 phishing e-mails assimilated over a period of three years from November 27, 2004 to August 7, 2007; (ii) On the other hand, the ham dataset is obtained from SpamAssassin dataset. The ham dataset consists of 6950 e-mails split into easy and hard ham set. As the name indicates, the hard ham set contains e-mails with content, linked URLs, and MIME formats that are similar to spam e-mails; and (iii) hard financial ham dataset contains a total of 577 legitimate e-mails sent by financial institutions such as HSBC, Amazon, eBay, American Express, Citibank, ICICI bank, Google checkout, etc. Probably due to the sensitive

59 nature, there was no publicly available e-mail dataset that contained legitimate e-mails from finan- cial institutions. Hence, the dataset used was gathered from e-mails sent over by four volunteers instead. Since phishers spoof financial e-mails, testing the classification models using this dataset accurately gives out the false positive rate. In order to obtain better classification results, it is important that the dataset used be re- moved of all unwanted information that can skew the results. For this purpose, all the attach- ments are stripped off from the e-mails. Then, each e-mail is tokenized into a sequence of words

W = {w1, w2,..., wn}. These tokens are passed through stop words elimination algorithm that re- moves all commonly occurring English words such as the articles, conjunction and prepositions, etc. Stop word elimination removes the noise from the dataset by removing common, yet low impact words. Each word is then passed through a linguistic module that tags them along with their parts-of-speech (POS). Opennlp-tools-1.4.2, an open source natural language processing (NLP) software is used for this purpose. This tool is preloaded with various annotated models that assist in automated parsing, tokenizing, chunking and part-of-speech tagging of English sen- tences. Once POS tagging is complete, all the 12 linguistic features are computed accordingly. The structural features can be obtained directly from the e-mail body in a straightforward fashion. Unlike the linguistic features which are continuous, the structural features are binary having indi- cator variables with values 1 and 0 to denote phishing and ham respectively. Lastly, in other textual features, the words in set W are passed through a NLP stemmer. Stemmer improves the classifiers performance by normalizing each word across documents by converting them into its etymological root, thereby making it easy for classification. For example, words such as prompt, promptly, prompted are converted into the root prompt. Snowball stemmer, an extension of Porter stemmer, is used for this purpose. After stemming is done, depending on the TF-IDF score,the top 1000 1-gram, 2-gram and 3-gram features are derived from W. However, since there can be multiple features that have the same TF-IDF score, the empirical total of features selected in each set can be more than 1000. The entire sequence of dataset sanitization is illustrated in Figure 3.9.

60 Figure 3.9: Preprocessing and sanitization of e-mail dataset before classification

Effect of Individual Features on Classification

Even though it is the cumulative effect of all the features involved that influences the overall clas- sification process, it is equally important to analyze the role of each individual features to justify their selection. For this purpose, the impact of each individual feature on total e-mails and phishing e-mails is studied. Figure 3.10 gives out the comparison of each linguistic feature with regard to their presence in phishing e-mails and total e-mail fraction. The values of the linguistic feature are plotted in the horizontal axis as equal sized bins. The bar graph in the figure depicts the distribution of feature values (histogram) with respect to the fraction of e-mails (given in the vertical left axis), where as the line graph represents the distribution of feature value with respect to the probability of them being phishing e-mails (given in the vertical right axis). As shown in the figure, the bar graph depicting noun fraction in total e-mails vaguely follows a normal distribution with 0.327 mean and 0.0981 standard deviation. This indicates that around half (57%) of all e-mails have fraction of noun words less than 0.327%. Moreover 53% of all the e- mails around the peak of the bar chart are phishing e-mails. Also, more than 95% of all the e-mails with noun fraction value less than 0.15 are phishing e-mails. Unfortunately, as around 71% of total e-mails in the interval [0.2, 0.375] have almost equal probability of being ham (58%) and phishing (42%), noun fraction value by itself does not serve as a good discriminator to classify phishing e-mails. In the pronoun fraction chart, the bar graph has 0.054 mean with standard deviation of 0.033. Also, the biggest chunk of total e-mails (33%) have pronoun fraction value between [0.025,

61 0.18 100 0.35 100 0.16 90 90 0.3 0.14 80 80 70 0.25 70 0.12 60 60 0.1 0.2 50 50 0.08 40 0.15 40 0.06 30 0.1 30 0.04 20 20 0.05 fraction of e-mails fraction 0.02 10 e-mails of fraction 10

0 0 phishing of probability 0 0 phishing of probability

Noun Fraction Pronoun Fraction

0.3 100 0.45 100 90 0.4 90 0.25 80 0.35 80 70 70 0.2 0.3 60 60 0.25 0.15 50 50 0.2 40 40 0.1 0.15 30 30 20 0.1 0.05 20 fraction of e-mails of fraction 10 e-mails of fraction 0.05 10 probability phishing of probability 0 0 phishing of probability 0 0

Verb Fraction Sentence Length

0.8 100 0.25 100 90 0.7 90 80 0.2 80 0.6 70 70 0.5 60 0.15 60 0.4 50 50 0.1 40 0.3 40 30 30 0.2 0.05 20 20 fraction of e-mails of fraction fraction of e-mails of fraction 10 0.1 10 probability of phishing of probability

probability of phishing of probability 0 0 0 0

Word Length Pausality

62 0.09 100 0.09 100 0.08 90 0.08 90 0.07 80 0.07 80 70 70 0.06 0.06 60 60 0.05 0.05 50 50 0.04 0.04 40 40 0.03 0.03 30 30 0.02 20 0.02 20

fraction of e-mails fraction 0.01 10 e-mails of fraction 0.01 10

0 0 phishing of probability 0 0 phishing of probability

Lexical Diversity Content Diversity

0.16 100 0.16 100

0.14 90 0.14 90 80 80 0.12 0.12 70 70 0.1 0.1 60 60 0.08 50 0.08 50 40 0.06 40 0.06 30 30 0.04 0.04 20 20

fraction of e-mails of fraction 0.02 fraction of e-mails of fraction 0.02 10 10 probability phishing of probability 0 0 phishing of probability 0 0

Expressivity Ratio Passive Ratio

0.25 100 90 0.2 80 70 0.15 60 50 0.1 40 30 0.05 20

fraction of e-mails fraction 10

0 0 phishing of probability

Self Reference Ratio

Figure 3.10: Effect of each individual feature in total fraction of e-mails and phishing e-mails 63 0.05]. Meanwhile, the probability of phishing e-mails in this chunk is only around 12%. Also, in the [0.1, 0.175] interval, there is a clear distinction between phishing and ham e-mails with more than 80% of the total e-mails being phishing. It can be seen from the verb fraction chart that probability of being classified as phishing is high in e-mails with large verb fraction values. For example, more than 65% of e-mails with fraction value of 0.175 are phishing e-mails. In the sentence length graph, there is a clear demarcation of phishing and ham e-mails along the left and right tails of bar graph. Sentence length is measured by the average number of words present across all sentences in an e-mail. Majority of ham e-mails ( 75%) of e-mails have average sentence between 20 and 30 words. E-mails having average sentences length of more than 60 words have high probability of being phishing e-mails. This is attributed to the fact that most regular e-mails represent personal communication, and are terse when compared to phishing e-mails. Similarly, as shown in the word length chart, almost all e-mails having an average of more than 12 characters across all words in its body is phishing. About 82% of all e-mails have an average word length of between 6 and 9 characters. On the other hand, from the pausality graph it is evident that about 80% of e-mails with very little punctuation (< 0.025) are phishing. Also, about 23% of all e-mails that have pausality value between [0.05, 0.01] have around 65% chance of being phishing. In the lexical and content diversity chart, since the bar chart and the line chart varies proportionally with probability of being phishing roughly around 50% in each segment, they alone do not serve as good discriminators to classify phishing e-mails. It is evident that most of the phishing e-mails have significant proportion of passive voice sentences. In about 30% of total e-mails with passive voice ratio between [0.25, 0.325], 70% of them are phishing. Moreover, the distributions of bar graph and line graph are noisy towards the right tails. About 23% e-mails have self reference ratio between [0, 0.025], with only 22% of them being phishing. Remaining bins from [0.25, 1.0] are same sized barring a few exceptions. However, the probability of phishing e-mails in these bins oscillates between 35% and 65%. In the uncertainty ratio chart, the probability of phishing e-mails decreases drastically in the right tail, especially in the range [0.025, 0.975] with most of them being ham e-mails. The expressivity ratio also follows a similar pattern with probability of

64 Table 3.2: Distribution of structural features in phishing and ham e-mails. The value 1 indicates the presence of the features, where as 0 indicates its absence

Feature Value % of Phishing e-mails % of Ham e-mails Difference in visible 0 76.5 49 and hidden URL 1 32.5 51 Generalization in 0 > 99.5 60.5 addressing recipients 1 < 0.5 39.5 Is URL is dotted 0 < 0.01 81.2 IP format 1 > 99.99 18.8

phishing decreasing drastically from its initial peak of 63% at value of 0.125. The structural features are discrete as opposed to the linguistic features having values 0 or 1 indicating their absence and presence respectively. Table 3.2 shows the distribution of structural features indicating their presence in ham and phishing e-mails. It can be seen that around 76% of phishing e-mails have difference in visible and hidden links in order to redirect the users into fake Websites, as opposed to only 49.5% ham e-mails. The discrepancy in URLs in ham e-mails is due to the fact that visible portions have text in them even though the hidden URL refers to the legitimate domain. Almost all of the ham e-mails do not refer their recipients in a generalized fashion. On the contrary, around 39.5% of phishing e-mails address their recipients with non- personal salutation messages such as customer, member, subscriber, etc. Furthermore, it is clear from the table that e-mails having dotted IP format URLs are likely to be phishing e-mails. More than 99.99% of all phishing e-mails have URLs in dotted IP format, where as only 18.2% of ham e-mails have URLs in dotted IP format. As discussed earlier, based on the TF-IDF score, the top 1000 1-gram, 2-gram and 3-gram terms are chosen to form the textual features. This resulted in a total number of 1542 1-gram terms, 1818 2-gram terms and 1666 3-gram terms respectively.

65 Evaluation Metrics

There are five metrics used in the evaluation viz. (i) detection rate; (ii) false positive rate; (iii)

precision; (iv) recall; and (v) f1 statistic. Let nh→p be the number of ham e-mails classified as

phishing, nh→h be the number of ham e-mails classified as ham. Let np→h be the number of phishing

e-mails classified as ham, np→p be the number of phishing e-mails classified as phishing. np→p

denotes the detection rate, where as nh→p denotes the false positive rate.np→h denotes the number of phishing e-mails classified as ham. Precision(p) is calculated as the ratio of the number of phishing e-mails classified as phishing to the sum of the number of phishing e-mails classified as phishing and the number of ham e-mails classified as phishing, therefore p = np→p . On the other np→p+nh→p hand, recall(r) is calculated as the ratio of the number of phishing e-mails classified as phishing to sum of the number of phishing e-mails classified as phishing and the number of phishing e-mails

np→p classified as ham, therefore r = . Finally, f1 statistic gives a measure of accuracy and is np→p+np→h 2pr computed as weighted average of precision and recall. Therefore, f1 statistic is defined as p+r .

3.6.4 Experimental Setup

Each machine learning algorithm is tested against 14 different feature set compositions. Broadly speaking, two different dataset factions were formed – one having all the e-mails and another one comprising of just the phishing and SpamAssassin ham datasets without any hard bank ham e-mails. Such separation would permit to study the effect of hard bank ham e-mails sent over by legitimate financial institutions on the classifiers. Each of these factions, in turn, consist of 7 different feature sets, given as follows: (i) feature set having all the linguistic and structural

features, along with 3-gram words (F3gram_all); (ii) feature set having all the linguistic and structural

features, along with 2-gram words (F2gram_all); (iii) feature set having all the linguistic and structural

features, along with 1-gram words (F1gram_all); (iv) feature set having just the 3-gram words as

feature set (F3gram_only); (v) feature set having just the 2-gram words as feature set (F2gram_only); (vi)

feature set having just the 1-gram words as feature set (F1gram_only); (vii) feature set having just

66 linguistic and structural features (Fdisc). In the experimentation, 10-fold cross validation scheme is used. In other words, the input e- mails are divided into 10 subgroups (folds) randomly. Among these 10 subgroups, 9 are chosen for training, and 1 is used for testing. The cross validation is repeated 10 times, and the results obtained are averaged out. Since there is more number of ham e-mails than phishing e-mails, it might be possible for ham e-mails to dominate a particular subgroup. To avoid this problem, stratified cross validation is used, where the same proportion of phishing and ham e-mails are retained in each of the subgroup as in the original dataset.

Results

Table 3.3 shows the performance of the three classification algorithms when tested with all three datasets and Table 3.4 shows the results when tested only with the phishing and SpamAssassin datasets. As shown in both the tables, SVMs exhibit the best performance except when tested only on structural and linguistic features. This reinforces the fact that SVMs are well suited for text classification, performing exceptionally well when the input features are textual in nature. On the other end, naïve Bayesian classifier performs poorly when textual features are included for testing. Since naïve Bayesian classifier involves manipulation of prior and posterior probabilities of input variables, it performs well with discrete features as opposed to continuous features having more than 90% detection rates. Even though naïve Bayesian classifier detection rate is poor when tested with textual features, it still has very negligible false positive rate. However, the lack in detec- tion does not compensate for marginal decrease in false positive rates when compared with other two classifiers making naïve Bayesian classifier a poor choice in classifying phishing e-mails. In the experiments, decision trees have been constructed using J48 algorithm which is an adaption of C4.5, a popular decision tree generation algorithm developed by Quinlan [99]. Decision trees performance is very much comparable with the performance of SVMs only exhibiting slight degra- dation. Precision, recall and f1 statistic have values over 90% in most cases for SVMs and decision trees. Also, including structural and linguistic features markedly improves the overall performance

67 when compared with only the textual features. This is due to the fact that linguistic features encom- pass the underlying tone and grammatical traits of phishing e-mails during classification. F3gram_all shows better performance in both the tables both in terms of detection and false positive rates than

F2gram_all and F1gram_all. As opposed to the common notion, textual features having 3gram words do not perform better than features having 2gram and 1gram textual features. This might be due to the fact that there might be enough unique 3gram features shared across phishing datasets that can be used for classification. When comparing the performances from Table 3.3 and Table 3.4, it is clear that hard bank ham dataset decreases the overall performance (as shown in Table 3.3). As stated earlier, hard bank ham datasets comprises of e-mails sent by legitimate financial institutions, and share same textual and linguistic features as ham e-mails. Consequently, classifiers exhibit high false positive rates when the input dataset has hard bank ham e-mails. Another important factor to consider while evaluating classifiers is the time taken by them to build the discriminator model, which then can be used to test the incoming e-mails. Naïve Bayesian classifier takes an average of 176.78 seconds to build discriminator model. SVMs perform fastest taking only 64.99 seconds on an average, while decision trees are slowest consuming 16.68 minutes.

Discussion

Even though the machine learning algorithms show promising results in detection of phishing e- mails, they are just a means to an end, and do not form the complete solution. Jakobson et al. [65] found that a large fraction of users do not have any knowledge about phishing attacks, hence merely tagging an in-bound e-mail as phishing does not dissuade users from clicking on spurious links. To address this problem, CMU’s CUPS lab focus on recruiting Internet users and educating the users about the disastrous effects of phishing attacks using Web based educative games [106]. However, for such an education mechanism to be ubiquitous, it is important that they be handled within the confines of the mail client. What is even more important is that if there is a reliable mechanism to communicate the import (tone) to users, then users can not only identify similar

68 Table 3.3: Performance of machine learning algorithms when tested against phishing, SpamAssas- sin and hard ham datasets.

Feature Set Detection Detection False Positive Precision Recall f1 Used Algorithm Rate Rate statistic Naïve Bayes 68.9 0.10 99.7 69.0 81.6 F3gram_all SVM 97.2 0.59 99.0 97.2 98.1 Decision Tree 97.3 1.17 98.1 97.3 97.7 Naïve Bayes 87.6 1.6 87.6 92.1 93.7 F2gram_all SVM 98.6 0.4 99.3 98.6 98.9 Decision Tree 97.9 1.0 98.4 97.9 98.1 Naïve Bayes 90.8 2.4 95.8 90.8 93.2 F1gram_all SVM 99.1 0.3 99.6 99.1 99.4 Decision Tree 98.4 0.9 98.5 98.4 98.4 Naïve Bayes 67.9 0.10 99.8 67.9 80.9 F3gram_wrds SVM 89.1 0.6 98.8 89.1 93.7 Decision Tree 86.5 1.2 97.8 86.5 91.8 Naïve Bayes 77.4 0.3 99.4 77.4 87.1 F2gram_wrds SVM 98.9 1.0 98.4 98.9 98.7 Decision Tree 94.5 0.8 98.6 94.5 96.5 Naïve Bayes 89.7 2.4 95.9 89.7 92.7 F1gram_wrds SVM 99.0 0.4 99.3 99.3 99.3 Decision Tree 97.8 1.4 97.6 97.8 97.7 Naïve Bayes 91 6.6 89.3 91 90.1 Fdisc SVM 76.7 2.7 94.6 76.7 84.7 Decision Tree 95.3 2.5 95.9 95.3 95.6

e-mails in the future, but also understand the modus operandi of spoofed e-mails. For this purpose, in the next two sections, the process of context sensitive warning generation and its effectiveness are discussed.

3.7 Context Sensitive Warning Generation

The process of generating context sensitive warnings is two-fold: (i) First, the suspect e-mail’s content needs to be parsed and analyzed to extract underlying tone; (ii) then, the extracted tone

69 Table 3.4: Performance of machine learning algorithms when tested against only phishing and SpamAssassin datasets.

Feature Set Detection Detection False Positive Precision Recall f1 Used Algorithm Rate Rate statistic Naïve Bayes 72.9 0.0 100 72.9 84.3 F3gram_all SVM 97.6 0.5 99.3 97.6 98.4 Decision Tree 97.8 1.1 98.3 97.8 98.1 Naïve Bayes 79.5 0.4 99.1 79.5 88.2 F2gram_all SVM 98.7 0.3 99.6 98.7 99.1 Decision Tree 98.4 0.8 98.8 98.4 99.6 Naïve Bayes 91.4 0.4 99.3 91.4 95.2 F1gram_all SVM 99.3 0.1 99.8 99.3 99.6 Decision Tree 98.4 0.9 98.7 98.4 98.6 Naïve Bayes 70.3 0.0 100 70.3 82.6 F3gram_wrds SVM 89.5 0.6 99 89.5 94 Decision Tree 86.5 1.1 98.1 86.5 91.9 Naïve Bayes 81.2 0.2 99.6 81.2 89.5 F2gram_wrds SVM 98.9 1.2 98.2 98.9 98.6 Decision Tree 94.6 0.8 98.7 94.6 96.6 Naïve Bayes 90.2 0.4 99.3 90.2 94.5 F1gram_wrds SVM 98.9 1.2 98.2 98.9 98.6 Decision Tree 97.9 1.0 98.4 97.9 98.1 Naïve Bayes 92.2 5.7 91.3 92.2 91.8 Fdisc SVM 77.5 1.4 97.3 77.5 86.3 Decision Tree 96.6 1.8 97.2 96.6 96.9

must be communicated in an effective fashion so that the user does not respond to the e-mail. For the first step, during the training stage, like before, each phishing e-mail is broken down

into a sequence of words W = {w1, w2,..., wn}. The goal is then reduced to finding an optimal termset, Wspoof ⊆ W x W, such that the “tone” conveyed in the spoofed e-mail is accurately cap- tured. At a broader level, even though Wspoo f can be extracted using “bag-of-words” approaches that enumerate frequently appearing words in the e-mail text, they, however, do not take into ac- count the context (presence or absence) of a word with regard to other words in the text. For example, the sentence do not provide password or social security number... has great chance of

70 Figure 3.11: Relationship between the frame Taking_Time and other abstract frames representing semantic roles that specify time boundedness in FrameNet being misconstrued to be phishing e-mail. In other words, as words that appear frequently in phish- ing e-mails also appear in e-mails sent from legitimate institutions, these 1-gram based approaches fail to scale well. In addition, these approaches do not account for grammatical relevance/context of the words appearing in the phishing e-mail. In order to capture the tone conveyed in phishing e-mails, during the time of tokenization and NLP tagging, each word is also paired with its part-of-speech (POS), such as noun, verb, adjective, etc. As a result, each term in Wspoof contains a set of related words that appear in the phishing e-mail along with their POS. Then the extracted words are passed through WordNet [35], an online resource where the nouns, verbs, adjectives and adverbs are grouped into a set of cognitive syn- onyms (synsets), which expresses a distinct concept. For example, the synsets for word immediate consists of words/phrases instantly, straightaway, straight off, directly, now, right away, at once, forthwith, like a shot. Each synset is also provided with a short summary (gloss) to provide more descriptive definitions. The gloss for the synset of word immediate as provided by WordNet is without hesitation or delay; with no time intervening. Even though the glosses provide description for synset, they do not annotate their underlying semantic roles, which is crucial in identifying the tone conveyed by the e-mail. Using the lexical knowledge base FrameNet [13], synsets are mapped into one or more pre-annotated semantic frames depending on the type of event or state and the participants associated with them. These semantic frames are representations of situations

71 involving various participants, properties, and other conceptual roles. Each frame has a set of asso- ciated words (lexical units), and is descriptive of a specific context/situation. Currently FrameNet consists of around 850 semantic frames with 135,000 annotated sentences. Also, each frame can be derived or related to many other frames. The set of frames representing a time bound context, i.e., the synsets of immediate, is shown in Figure 3.11. The algorithm given in [23] is adopted to convert the synsets into equivalent FrameNet frames. The algorithm, essentially, is a two step process: (i) In the first step, all candidate frames, whose lexical unit comprises of words in the synset or their variants (hypernyms/antonyms) are chosen. Then, for words that are not listed in the lexical unit of any frame, the names of the frames are checked to determine if they contain the words in synsets. For a frame to match, there has to be at least 50% overlap between the word and the frame name. Finally, in order to select the best set of frames, all candidate frames are weighted depending on selection. The boostfactor places more weight on the frames chosen for having the synset words in the lexical unit, as opposed to the frames which contains the words in their frame names. The weights for each frame is computed as shown in Algorithm 4. In order to extract the tone of the e-mail, a conglomerative set of frames is created by aggregating all the frames returned for each word in the phishing e-mail message. In the training phase, these conglomerative sets are grouped into five semantic contexts, namely, Justification, Penalty, Urgent_Action, Reward and Concern, depending on the corresponding returned constituent frames and their weights. Also, appropriate context sensitive warnings are assigned to the conglomerative sets, which describe the intent of the phisher to the recipients in an effective manner. In the testing phase, each e-mail tagged as phishing by machine learning algorithms is analyzed to see if it matches exactly/partially with one or more tagged conglomerative frames. Then depending on the match, the generated context sensitive warnings are communicated to the user. Figure 3.12 shows the text box indicating warning messages for a HSBC phishing e-mail which threatens users to disclose sensitive information by using account revocation as the underlying threat. For a user who is still not convinced by the warning, an option of forwarding the e-mail to the provided security department is provided so that accurate response about the validity of the

72 input : WordNet Synsets for each word extracted from the e-mail output: A set of FrameNet frames with corresponding weights indicating the overall relevance

1 for each word ws in the synset do 2 search_words = set of related hypernyms, antonyms corresponding to ws from WordNet; 3 end 4 evoked_by_lexical_unit = φ; 5 evoked_by_name_match = φ; 6 for each frame f in FrameNet do 7 for each word w in search_words do 8 if w is in lexical unit of f then 9 evoked_by_lexical_unit( f ) = evoked_by_lexical_unit( f ) ∪ w; 10 spreading_factor(w)+ = 1; 11 else if (w has 50% match with f ’s name) then 12 evoked_by_name_match( f ) = evoked_by_name_match( f ) ∪ w; 13 spreading_factor(w)+ = 1; 14 end 15 end 16 for each word w in search_words do 17 for each frame f in FrameNet do P similarity(ws, w) ∗ boostfactor 18 Weight( f ) = w∈ evoked_by_lexical_unit( f ) spreading_factor(w) ; P similarity(ws, w) 19 Weight( f ) += w∈ evoked_by_name_match( f ) spreading_factor(w) ; 20 end 21 end

Algorithm 1: Mining set of weighted FrameNet frames for WordNet synsets e-mail can be obtained. Disallowing the user to respond to the e-mail until proper clearance is obtained can also be done in a forceful manner as shown in [22].

3.7.1 Experiences with Context-Sensitive Warning Generation

To identify the tone of an e-mail, all the frames returned by FrameNet for that particular e-mail are grouped together and assigned a semantic context. For example, e-mails that impose a sense of urgency in the recipients (i.e., belonging to the Urgent_Action category), have frames with names as Response communication_response, Requesting, Activity_pause, Submitting_documents

73 Figure 3.12: Context-sensitive warning explaining the tone of the phishing e-mail to the user compliance linked to them. Similarly, e-mails that express a false sense of concern (i.e., belonging to Concern category) as a deception mechanism are matched with frames such as Assistance, Per- sonal_relationship, Cause_to_start, Becoming_aware, Evidence, Request, Education_teaching, etc. Phishing e-mails that attempt to extort private data from users by threatening them with account revocation, (i.e., belonging to Penalty category), are tagged with frames such as In- hibit_movement, Thwarting, Compliance, Scrutiny, Attempt_suasion Telling, etc. Similarly, frames returned by FrameNet corresponding to e-mails that lure victims by giving them fake rewards (i.e., frames belonging to the Reward category) have names such as Telling, Personal_Relationship, Compatibility etc. Lastly, e-mails belonging to Justification category have names such as Wak- ing_up topic, Questioning, Leadership Request, Execute_plan using, Protecting etc. In the training phase, each conglomerative set is tagged to only one of the five semantic con- texts. It is important to note that these five semantic contexts can have overlapping individual frames. In the experiments, there were a total of 126 frames returned by FrameNet, which were formed by combining one or more of the 850 pre-annotated frames. Also, on an average each e-mail is associated with 15 different frames. In the testing phase, each e-mail’s tone is assigned

74 a semantic context depending on the set of frames returned by FrameNet and the mapping done in the training phase. Experiments reveal that 62% e-mail fall under the Urgent_Action category. Around 18% of phishing e-mails fall into Penalty category and 13% fall into Concern category. Out of the remaining 7% of phishing e-mails, 5% are belong to Justification category and the rest belong to Reward category. The warning message for each of these five semantic contexts are charted out beforehand and stored. Depending on the outcome of the analysis, the appropriate warning message is displayed to the user as a modal text box.

3.8 Summary

This work starts with highlighting the limitations of existing spam filters in detecting phishing e- mails. A customizable and usable spam filter (CUSP) is proposed as a first line of defense to assist users with building and enforcing tailored filters that can be used to identify phishing e-mails. However, since CUSP is rule based, it can be easily evaded by phishers. To address this limitation, a novel methodology to detect phishing e-mails based on linguistic, structural and textual features is proposed. Classification of phishing e-mails is done by employing three popular machine learning algorithms, viz. (i) naïve Bayes classifier; (ii) support vector machines (SVMs); and (iii) decision trees. Experimental results demonstrate that using linguistic features for classification increases the overall detection rate while keeping the false positive rate low. Lastly, a context sensitive warning generation mechanism is proposed to educate users about the consequences of phishing attacks.

75 Chapter 4

Detecting Spurious E-mails through Linked-to Website Analysis

“Simply stated, it is sagacious to eschew obfuscation.”

− Normal R. Augustine

4.1 Introduction

The classification algorithms mentioned in chapter 3 require e-mail messages to be composed in text. Unfortunately, phishers and spammers can encode the content of an e-mail message as an image making it difficult to extract the necessary textual and structural features used in classifica- tion. Since most present day e-mail clients render images by default, despite such obfuscation, the underlying malicious content and URLs that link to spurious Webpages are directly presented to users. Even though optical character recognition (OCR) techniques can be employed to extract text from images, they suffer from two major limitations. First, it is possible for attackers to encode text using wavy characters with different colored backgrounds, which look normal to the human eye, yet obfuscated enough so that OCRs cannot decipher them effectively. Second, if OCRs are to be deployed for spam detection, it would imply that every incoming e-mail having an embedded image will be analyzed by it. Such an approach if deployed on a mail server might be untenable, as it would significantly increase the overall computation load. Figure 4.1 shows an example image

76 S UD_n>H

From: HSBCBankPIc Date: Friday, October 31, 2008 12:20 AM To: Subject: HSBC Internet Ban_ng Security Upgrade Notification

_e wo,Id', lo,d bmk

U_ted K_dom H_ome

DearHSBCCustomer:

As n I)nnk we _i_re rIse(I _o _hinkinu _i_I)orI_ secrIriy. A_ HSBC. we rIse in(IrInIy ni_n{I_i_I (I secrIri_ _echnolouy nn(I I)r_i_c_ices. focrIsinu on _hree key i_rens ? I)riv_i_cy. _echnolouy i_nll i(len_i_c_i__ion _o snfeurInr(I yorIr nccorIn_ from _i_ny _Ini_rI_horise(I _i_ccex.

UrIr Technici_I seIvices [teI)_i_rtmen_ nre cnI Iyinu orI_ n I)Innne(I somMi_re rII)urn(le foI _he mnximrIm convenience of_he rIseIx of onlineseIvices of_he HSBC B_i_nk.Plei_se click on LlIG UNTU INTERNET BANHING I)elow _o res_ore yorIr nccorIn_ nccex ns soon i_s I)oxil)le.

>LOG ON TO fNTERNET DAN_NG

Thi_nkYo_I, C r Isto m e r A ( Iv iso ry HSBC B_i_nk PIc

Leui_I iI)foII))_i__ioI) I Accexil)iliy I AI)orI_ HSBC I Si_e I))_i_p I IssrIe(I foI UH rIse oI)ly I lct HSBC

Figure 4.1: An example illustrating inefficiency of OCRs to parse out text in image based spurious e-mails. The output of GNU OCR (gocr) when applied on an image based HSBC phishing e-mail is shown here.

based phishing e-mail along with the output of an OCR. Encoding the content of an e-mail message as an image to bypass spam filters is just one side of the coin. It is possible for phishers and spammers to compose e-mails in text format and still evade even the specialized filters that apply content analysis techniques for classification. For example, word-salad or junk words can be spewed in between text to skew the performance of statistical and learning based filters [70,92,123,131]. Spammers can also modify text in e-mails by using Base64/UUencode/Quoted Printable encoding, HTML/URL encoding and substituting alphabets with numbers [60]. Phishers, for instance, obscure the word PayPal.com by substituting the letter l with number 1 (as PayPa1.com), thereby masking the name of the spoofed brand. Also, using natural language processing (NLP) techniques, suspicious words can be substituted

77 with innocuous words that express similar meaning. These innocuous words must be carefully picked so that they have different probabilities than their base words. Karlberger et al. demonstrate the feasibility of launching such a word-by-word substitution attack on SpamAssassin, DSpam and Gmail [70]. Furthermore, attackers could contaminate spam filters performance by inducing them to block even the normal e-mails [92]. Despite these complex techniques, a phisher can simply choose to exclude any form of text in the e-mail body and still carry out an attack by having just one URL that refers to a spoofed Website. A viable approach to detect such spurious e-mails is to examine the content of Websites re- ferred by URLs contained in them. Ultimately, the success of a spurious e-mail depends on its linked-to Website’s ability to appear legitimate and convincing to normal users. Therefore, un- like spurious e-mails, the linked-to Websites are well-formed in their layout and behavior. Unlike spurious e-mails that are ill-formed having junk content to blindside e-mail filters, the linked-to Websites are fairly well-formed in their structure and layout. Fortunately, from a defense perspec- tive, this consistency in layout and behavior can be utilized to separate spurious and ham e-mails apart. However, there is a significant difference between phishing and spam Websites that needs to be taken into account while devising strategies for classification. Phishing Websites usually bear striking resemblance to their legitimate counterparts. The factor that separates them is their behav- ior; phishing Websites’ intention is to steal user credentials, where as legitimate Websites employ user credentials for authentication purposes only. This chapter proposes a novel framework called PHONEY for automatic detection and analysis of phishing attacks based on the difference in be- havior of phishing and legitimate Websites. The key idea behind PHONEY framework is to protect identities of the end-users by providing fake information to the Websites’ requesting sensitive in- formation until their authenticity has been verified. PHONEY leverages on the premise that just as an end user cannot tell legitimate and spoofed e-mails apart, phishers cannot tell the responses of legitimate and phantom user responses apart. Spam Websites, on the other hand, do not exhibit such a difference in behavior when compared with legitimate Websites. As they typically sell or advertise products related to adult, financial,

78 drugs and leisure industries, examining their content for presence of keywords that describe the products being sold by them serve as a valuable feature for detection. It is important to note that even though design and layout of these Webpages may vary, their underlying content remains relatively unaltered. For example, Rolex watches and Viagra-like drugs are commonly advertised products in spam e-mails. Moreover, according to 2006 spam statistics, e-mails advertising these products formed more than 95% of total e-mail spam [48]. To this extent, this dissertation identifies a set of features that accurately encapsulates the appearance and content of the linked-to spam Webpages. The identified features are fed as input into 5 different machine learning algorithms for classification purposes. Finally, the proposed approach is evaluated using live spam and ham (legitimate) e-mails and its performance is compared against two popular open source anti-spam tools, viz. Vipul Razor and SpamAssassin.

4.1.1 Contributions

The main contributions of this chapter are as follows:

• Proactive Challenge-Response Technique to Detect Phishing E-mails – A novel frame- work called PHONEY is proposed to shield users from falling prey to phishing attacks. PHONEY sits between a user’s mail transfer agent (MTA) and mail user agent (MUA) and processes each incoming e-mail for suspicious behavior. PHONEY is tested on 274 live phishing Websites and 20 legitimate Websites. Experimental results reveal that it is able to detect a wider range of phishing attacks with zero false positives. Also, the performance analysis study shows that the total overhead introduced by PHONEY is very negligible.

• Classification of Spam Webpages based on linked-to Webpages Analysis – In order to detect spam e-mails that obfuscate their content, a novel detection technique is proposed that focuses on analyzing the layout and the content of linked-to Websites. In this regard, a set of textual features along with 5 structural features are first identified, and then fed as input into 5 different machine learning algorithms for the purpose of classification. Experimental results

79 show that these algorithms can detect even the sophisticated spam e-mails with over 95% accuracy having significantly better performance than two popular open source anti-spam tools.

4.1.2 Chapter Organization

The rest of this chapter is organized as follows: In Section 4.2, an overview of PHONEY along with its working and architecture details is provided. The experimentation methodology used in testing PHONEY is also explained. Section 4.3 elaborates on using machine learning techniques to classify spam e-mails by analyzing the linked-to Websites. Discussion and concluding remarks of this chapter is provided in Section 4.4.

4.2 Overview of PHONEY

This section begins with providing the necessary background needed to bring out the importance of a proactive defense methodology to detect phishing e-mails. An overview of PHONEY’s work- ing and architecture is first provided. Also, the design considerations made while implementing PHONEY as a client-side tool are given for the sake of better understanding.

4.2.1 Background

Figure 4.2 illustrates an example phishing Website designed to target Paypal users. Typically, users receive spoofed e-mails luring them into visiting such fake Websites, which closely resemble their legitimate counterparts in layout and appearance. As shown in the Figure 4.2, the HTML layout, links, and embedded images in the spoofed Website are copied directly from the legitimate PayPal Website. However, the URL displayed on the browser clearly indicates that the spoofed Website hosted on a different domain. The login form provided on the Webpage serves to collect unsuspecting users’ username and password. After receiving the username and password, the login

80 Figure 4.2: An illustration of phishing Website fabricated to attack PayPal users. The displayed URL is clearly different from that of PayPal Website.

Webpage usually redirects users into another Webpage that requests more sensitive information such as creditcard number, account details, ssn, etc. The actual damage of phishing arises from phishers misusing the harvested sensitive information for personal gains. Several anti-phishing toolbars and browser extensions [34,45,46,93,111] have been proposed to prevent users from either visiting or divulging personal information into fake Websites. Most of these approaches examine the validity of the domain hosting the login Webpage to identify phish- ing Websites. Typically, IP address and domain name of the Websites are displayed on the toolbars by performing reverse DNS lookup to provide visual cues to users. Also, these toolbars prevent users from visiting blacklisted Websites verified by popular anti-phishing communities such as PhishTank, APWG, and OpenDNS. However, such a reactive approach has potential drawbacks.

81 It is possible for attackers to launch phishing attacks within the confines of a legitimate domain by displaying login forms as popup boxes using code injection and cross site scripting attacks. Furthermore, these toolbars are not completely automated and delegate the burden of decision making to users. The usability study conducted by Schechter et al. reveals that normal Internet users tend to ignore security warnings with over 50% participants giving out their credentials de- spite warnings provided by Internet Explorer (IE) 7 Web browser. Similarly, another such study conducted by Egelman et al. indicates that more than 20% of users ignore anti-phishing warnings. Another way to mitigate phishing attacks is to prevent users from disclosing sensitive information into fake Websites. For example, SpoofGuard [34], Dynamic Security Skins [42], PwdHash [103], PassPet [128], and Web Wallet [126] associate users’ login credentials with the domain names of legitimate Websites. When a user inadvertently supplies the username and password into a different domain, appropriate warning messages are raised. However, these approaches are also susceptible to attacks launched from legitimate domains. Moreover, in order to prevent false positives, a user needs to ensure that proper precautions are taken when reusing the same username and password across different domains. As opposed to the current approaches, which adopt a defense-centric viewpoint by implementing a passive defense wrapper in the browser, PHONEY adopts a more proactive approach that examines the behavior of Websites as a detection methodology, particu- larly for addressing the aforementioned shortcomings. Also, while being distinct, PHONEY can complement these reactive detection schemes in a natural way. PHONEY can leverage on the capabilities of the state-of-the-art phishing detection mechanisms, while empowering them with behavior based detection capabilities.

4.2.2 Scope of Detection

The main purpose of PHONEY is to detect phishing e-mails that cannot be caught by just exam- ining the textual and linguistic features present in their body. As a part of its operation, PHONEY focuses on detecting phishing e-mails by examining the behavior of spoofed Websites to fake in- puts. However, the idea behind PHONEY is generic enough so that it can be easily extended as

82 Figure 4.3: Defense-centric view: Who is the real Figure 4.4: Offense-centric view: Who is the real sender - legitimate or adversary? respondent - the real victim or a PHONEY? a browser add-on targeting a wider spectrum of attack vectors. In theory, PHONEY can be easily tailored to protect against other forms of phishing attacks that use Internet messaging (spimming) and malware (pharming) to redirect users into spoofed Websites.

4.2.3 Working of PHONEY

PHONEY views phishing as a two-stage game between the user and the adversary. In the first round, an attacker sends e-mail messages pretending to represent a legitimate business domain for tricking the users into divulging their personal information. The success of the attack lies in the phisher’s ability to craft the attack in a manner that a naive user is unable to differentiate be- tween the legitimate and the masqueraded messages, as shown in Figure 4.3. For the second stage, PHONEY analyzes the incoming message content for the presence of embedded links and attached HTML forms. If the e-mail contains no such traits, further investigation is safely discarded. Oth- erwise, a set of phantom users or fake identities are assigned to actively communicate with these Websites with appropriate random values. The random/fake information supplied to the Websites acts as active honeytokens [75], and the Websites’ responses are forwarded to the decision engine for further analysis, as shown in Figure 4.4. The key idea here is to shield a user from giving out sensitive personal information until the authenticity of the Website is established. Since the attacker cannot distinguish between the fake and legitimate responses, his response is the same to

83 Email Email Response Preliminary Content Processing Scanner

hashDB Does email contain Semantic analysis of Phishing? attack MTA URLs, forms, etc? suspicious content Dynamically MUA generate or not? phoneys

Figure 4.5: Block diagram of PHONEY architecture

both real and contrived responses.

4.2.4 Architecture Details

Figure 4.5 illustrates the architecture block diagram of PHONEY. PHONEY is deployed as a client side tool between the mail server and the mail client to detect and mitigate e-mail based phishing attacks. The process involving the various architectural components is described as follows: First, the preprocessor probes the mail server for incoming messages. Once an e-mail arrives, it parses the message’s body and checks to see if it contains embedded links and HTML forms. E-mails with HTML forms requesting critical information are directly tagged malicious. In the presence of embedded URLs, the control is passed to the content scanner, which then retrieves the source of the referred Web page for analysis. The Webpage with input forms are broken down further to extract its input element and its associated text. These extracted tokens are then compared against the entries in hashDB (see Figure 4.5) for the presence of fields such as username, password, credit card numbers, social security number, password, etc. Each tuple in the hashDB has two fields representing the token name along with its fake value. Depending on the information required to be sent out, the values corresponding to the tokens in the hashDB are supplied to phantom users during the time of their instantiation. The phantom users are virtual entities, created primarily for the purpose of interacting with the malicious Website. They interact with the Website by sending the requested information in the form of active honeytokens. The behavior of the Website to the

84 honeytokens is recorded and analyzed for any activities not conforming to reasonable response. The decision engine is formalized as a rule-based system, which relies on set of inference rules to deduce whether the process has terminated in any of the known attack instances.

4.2.5 Design Criteria

To be effective PHONEY must be designed keeping the following design criteria in mind. Accurate identification of HTML forms in Webpages – As a first step, it is important to de- termine whether a Webpage requesting sensitive information has HTML forms. Each HTML form comprises of multiple input elements, which can either be text field, checkbox, radio-button, submit button or drop down menu, depending on the nature of input to be submitted. Each input element, in turn, is assigned a unique name, type and default value. Each input element extracted from the HTML form is represented by a triplet , where form_name and element_name denote the name assigned to a HTML form and its input element respectively, and element_text represents the text portion of the HTML input element that is vis- ible to the users. If the text portion of the element is not considered, it might be possible for phishers to assign irrelevant names to input elements and evade detection. As some Webpages tend to have multiple HTML forms, each input element is also tagged with the corresponding form name. For example in case of Yahoo! Mail login page, the output of content parser is denoted as and . The in- put field names and the associated text are checked to see if they are contained in part or as a whole in the hashDB. Then the value associated with the entry is supplied as data to these form input elements. It is necessary that the value supplied be as close to the legitimate values and do not violate any size or type constraints. Efficient design of decision engine – The performance of PHONEY majorly depends on the ability of decision engine to accurate identify phishing Websites. The phantom users forward responses received from suspicious Websites to the decision engine, which then carefully parses and analyzes them to see if they resemble behavior of legitimate Websites. In general, two key

85 observations have been made in this regard to aid in detection. First, since fake information is supplied, a target Webpage† should request the same information from users again. Second, the target page should not redirect users to a different domain. This is used to detect man-in-the- middle attacks where phishers Website on acquiring users’ information transfer them to legitimate Websites. Scalability and usability issues – Modular design of various components of PHONEY allows for new and improved detection techniques to be introduced without expensive redesigns. For ex- ample, PHONEY can operate in conjunction with blacklists and whitelists expediting the decision process. Moreover, other defense mechanisms that reactively examine a Webpage’s layout rather than proactively analyzing its behavior can also be made to operate in conjunction with PHONEY. On the client-side PHONEY can be easily installed in the mail server or even as a browser add-on. An e-mail marked as phishing by PHONEY is not rejected outright due to the possibility of false positives. With IMAP clients, a folder called Phish is created and phishing e-mails are forwarded to it with a warning message appended to the message body. There might be cases where the deci- sion engine cannot make an authoritative decision. To tackle such situations, the warning message is appended to the e-mail body, but it is delivered to the user’s inbox instead.

4.2.6 Implementation

PHONEY is implemented as a 450+ line Perl script using WWW::Mechanize and WWW::Mechanize ::FormFiller modules, so that it can be seamlessly integrated as a wrapper around a MTA. Web- pages are parsed using HTML document object model DOM parser written in Perl. The HTML::Form module is used to extract the input elements from HTML forms and generate appropriate honey- tokens. Once the honeytokens are generated, an user agent is created using the WWW::Mechanize module, which acts as phantom users, to generate HTTP requests and submit fake inputs. The decision engine, which tags an e-mail as phishing based on the response of a linked-to Website, is also implemented as a PERL module. In order to graphically visualize the working of PHONEY

†target Webpage refers to the first page loaded by clicking the URL

86 on spoofed Websites, an ActiveX plug-in to IE 7 is built in C#.

4.2.7 Evaluation of PHONEY

Dataset

One of the prerequisites for testing PHONEY is that e-mails must be linked to live Websites. As phishing Websites are ephemeral, there is a fair chance that they are taken down even before the corresponding e-mails reach users’ mailboxes, thus making them unsuitable for testing. Also, since PHONEY detects phishing e-mails by analyzing the behavior of linked-to Websites, without any loss of generality, PHONEY can be evaluated using phishing Websites dataset alone. There are three datasets used in the experimentation: (i) The testbed dataset consists of 12 distinct phishing Websites. These Websites are hosted locally by observing and replicating the be- havior of live phishing Websites given in the APWG (www.antiphishing.org) Website; (ii) The live phishing Websites dataset, which includes URLs of real phishing Websites hosted externally, is obtained as feeds from PhishTank for a period of three days. Only those Websites unverified by PhishTank is chosen for the purpose of maximizing the chance of finding live phishing Websites. This dataset consists of a total of 509 URLs, with 274 of them being live during testing; and (iii) Lastly, in order to study the false positive rates, PHONEY is evaluated using legitimate Websites dataset, which consists of URLs referring to login pages of 20 most phished brands in 2008.

Testbed Experiments

To carry out experiments in a local setting, the phishing Websites in the testbed dataset were hosted on Intel Pentium 4, 2.6 GHz processor machine running Apache HTTP Server version 1.3.33. The underlying operating system is RedHat Linux running kernel version 2.4.20. PHONEY was installed as client side plug-in on an Intel Pentium M, 1.3 GHz Processor machine with 512Mb RAM. PHONEY was able to detect 10 out of 12 phishing Websites with minimum performance overhead. For further quantification, the total overhead introduced by PHONEY was broken down

87 into two parts: a) phantom user instantiation overhead; (b) response analysis overhead.

• Phantom user instantiation overhead – The overhead involved in instantiation of phantom user is the aggregation of time taken by the preprocessing component plus the time needed to extract fake values from the hashDB. The overhead caused here is mainly due to file I/O, while flooding the phantom user with appropriate type fake values. But since the number of distinct fields stored is small in number, the entire hashDB file can be loaded into memory during start of execution, thereby reducing the overhead. Instantiation of phantom users, on an average took 12 milliseconds with a standard deviation of 510 microseconds for its operations.

• Response analysis overhead – The response analysis overhead is the time taken to post the response of phantom users plus the time taken for analysis. The delay time because of response analysis was 23.5 milliseconds. It is important to keep in mind that for inactive Websites it might take longer because PHONEY has to wait till timeout.

From these observations, it can be safely concluded that PHONEY does not introduce any significant computation overhead in the system. Also, the modular nature of the individual sub- components provides hooks to replace existing modules with efficient variants, without affecting the overall performance. To illustrate the working of PHONEY graphically, the interaction be- tween the phantom user and the phisher’s Website is captured by hooking the detection engine, as an ActiveX plug-in. The obtained results has been summarized using the following three case studies. Example 1

In the first example, a simple e-mail based phishing attack targeting Regions bank users is con- sidered. First, a phisher sends spoofed e-mails in HTML format requesting users to supply lo- gin credentials by clicking an embedded link. The visible part of the embedded link https: //secure.regionsnet.com/EBanking/logon/user?a=defaultAffiliate masks the refer- ence to the spoofed Website http://www.club-daich.com/.checking/regions/. Such at-

88 Figure 4.6: Phantom users supplying fake login Figure 4.7: The detection engine flags the Web- information to the spoofed Website site malicious tacks can be easily determined by the preprocessing engine as shown in Figure 4.6, which relies its decisions based on such noticeable differences. Also, to further validate the claim, the phishing Website is supplied with fake authentication credentials as shown in Figure 4.7. Upon submis- sion, the Website predictably refers to another page asking credit card related information, thereby triggering PHONEY to raise an alarm. As most of the e-mail based phishing scams (10 out of 12 Websites replicated locally) adopt a similar attack model, PHONEY can trivially detect such kind of attacks. Example 2

In the second example, PHONEY is tested on phishing Website targeting eBay users. The spoofed e-mail has a URL, which redirects users to a phishing Website http://www.cba.or.th/member/. There were two noticeable differences in this phishing site: a) this site attempts to spoof its URL as legitimate using an Internet Explorer (IE) vulnerability. In the test machine, this spoofed URL was clearly detected since the machine was patched (it is important to note that this is not the reason why PHONEY tagged this site to be phishing. Since testing relies on the evaluating response of the spoofed Website, it is reasonable to assume that PHONEY is effective even if IE was unpatched); and (b) also, the behavior of this site is different from the other cases. Upon submission of hon-

89 eytokens, the user is asked to enter the same login credentials again. Only when the submission is made a second time in the same browser session, the user is redirected to another Webpage asking for more information. This is an excellent social engineering tactic where the phisher assumes that the naive user on receiving an email about account suspension would hastily type in wrong credentials. PHONEY can be tuned to repeatedly test the target Webpage to ensure correctness. Though an attacker can replay the same strategy by not allowing the user to login for a repeated number of attempts, therefore successfully escaping detection, such cases can be unfavorable from the phisher’s standpoint as it may invoke suspicion in the users if they are consistently denied ac- cess. Example 3

The third example shows the working of PHONEY on a Hotmail phishing Website, which uses JavaScript for validating inputs. This Website exhibits peculiar behavior than the rest causing PHONEY to tag it legitimate. Fortunately, since the e-mail was composed in text/html, the lin- guistic based approaches correctly tagged it as phishing. The reason for failure is given as follows: usually, when users type in their username in hotmail and move to the password field, a script automatically fills in the @hotmail.com as suffix to the username. This script was included in the phishing site also. However, as execution of JavaScript is disabled in PHONEY by default, the result of submitting contrived values pops up another JavaScript box asking for the information to be entered again. PHONEY determines this behavior to be legitimate and allows the e-mail to pass through incorrectly. Even though such out of ordinary behavior results in false negatives, it can be avoided by having custom rules that handle them effectively.

Experimentation on Live Phishing Websites

As mentioned earlier, PHONEY is tested against 274 live phishing Websites. These Websites are obtained from the PhishTank repository as XML feeds. Since Websites tagged as phishing by the PhishTank community are taken down immediately leaving very little time for testing, PHONEY is tested only against the recently submitted unverified entries. For its operation, PhishTank re-

90 lies on human volunteers and domain experts to manually inspect and validate suspect Websites. The outcome of PHONEY is later compared with the results from PhishTank to get estimates on detection and false positive rates. The major experimental findings are summarized as follows. PHONEY acts on all the 274 Websites by parsing their HTML content for the purpose of ex- tracting out the embedded HTML forms. Depending on the fields present in the forms, appropriate honeytokens are supplied to them. Unfortunately, not all the Websites have login forms in the target Webpage. Even worse, a significant fraction of phishing Websites does not contain HTML forms at all. To be precise, 87 of the 274 Websites (∼ 32%) did not contain any HTML forms in the target Webpage. This can be attributed to the following two reasons: a) Even though a phish- ing Webpage is taken down, the hosting Website might be still up and running. For example, an administrator might choose to remove a phishing Webpage from a legitimate Web hosting domain. Any attempt done to access the removed phishing Webpage results in “URL not found ” error in the client-side. About 76 (∼ 88%) phishing Websites that did not have any HTML forms fall into this category; and b) phishers might hide the login page using multiple intermediate pages. For example, a phisher can request users to click on the links in the landing page in order to visit the spoofed Website. The remaining 11 Websites (∼ 12%) belong to this category. Detecting phishing Websites that are cloaked by multiple legitimate Webpages is a hard problem. However, such a behavior is avoided by phishers as it minimizes the probability of a user falling prey to phishing attacks. About 159 (∼ 58%) of 274 e-mails are identified as phishing by both PHONEY and the Phish- Tank community. These Websites, typically, had a login page, which when supplied with honey- tokens transfer users to another page requesting more sensitive information. Also, in a few cases, users were routed to the legitimate Website’s login Webpage. A total of 16 (6%) Websites gave out “Submit Failed” error, when supplied with honeytokens. This can be due to the fact that they either were misconfigured or had JavaScripts validating the honeytokens. The remaining 12 Websites are tagged as genuine by both PHONEY and the PhishTank community. This demonstrates the fact that false positive rate of PHONEY is very low, in fact zero, even when tested against live phishing

91 Phantom User Instantiation Overhead 1 Response Analysis Overhead 0.9 1 0.8 0.9 0.7 0.8 0.6 0.7 0.5 0.6 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 310 330 350

1.75 2.25 2.75 3.25 3.75 4.25 4.75 8.75 9.25 9.75 0.25 0.75 1.25 5.25 5.75 6.25 6.75 7.25 7.75 8.25

Fraction of Phishing Websites Phishing of Fraction Fraction of Phishing of Websites Fraction 10.25 Time Taken (milliseconds) Time Taken (milliseconds)

Figure 4.8: Phantom user instantiation over- Figure 4.9: Response analysis overhead incurred head incurred while testing PHONEY against while testing PHONEY against live phishing live phishing Websites Websites

Websites. The cumulative distributive frequency (CDF) of phantom user instantiation and response analysis overhead are given in Figure 4.8 and 4.9 respectively. As shown in the figures, it takes less than 20 milliseconds to generate honeytokens and to analyze the response of the Phishing Websites. Therefore, the net delay added by PHONEY before an e-mail is delivered to the user’s mailbox is also negligible.

Experimentation on Legitimate Websites

PHONEY is tested against the legitimate login Webpages of 20 popular phished brands as given in Table 4.1. None of these Websites are incorrectly tagged as phishing by PHONEY resulting in zero false positives. Unlike phishing Websites that have HTML forms requesting account number, atm pin, creditcard details, SSN, etc., these legitimate Webpages have forms that request username and password only. On an average it took PHONEY 475.09 microseconds with standard deviation of 242.09 microseconds to parse the legitimate Websites’ login pages and submit honeytokens. Meanwhile, the average response analysis overhead for these Websites is 10558.45 microseconds with standard deviation of 24147.23 microseconds.

92 PayPal eBay IRS Sulake Google JPMorgan Wells Fargo HSBC Bank of America Lloyds TSB Poste Italiane Capital One Wachovia NatWest Banca di Roma Citibank Regions Bank Royal Bank of Scotland Barklays Bank PLC Amazon.com

Table 4.1: List of legitimate Websites used for testing PHONEY. These Websites correspond to top 20 most phished brands in 2008 reported by PhishTank

4.2.8 Limitations of PHONEY

A few limitations arise while using PHONEY for detecting phishing Websites. In this section, these limitations are discussed in detail, and some possible work is provided. PHONEY is vulnerable to man-in-the-middle attacks. It is possible for phishers to evade PHONEY by replaying the response of a legitimate site. Verifying to see if the target Website has a valid SSL certificate issued by a legitimate certificate authority (CA) before providing the fake inputs can mitigate trivial man-in-the-middle attacks, even though sophisticated attacks are still possible. Also, phishers can reject all inputs from users claiming them to be invalid. However, such a behavior is disastrous from a phisher’s standpoint, as it might invoke suspicion in users, if they consistently observe their inputs rejected despite providing the correct information. In order to blindside PHONEY, a phisher can construct spoofed Websites with ill-formed HTML content. As PHONEY’s content parser employs HTML DOM object model, having ill- formed HTML content implies that accurate detection of HTML form elements may not be possi- ble. Also, phishers can employ non-standard login forms using JavaScript or Flash content. In such cases, it might not be possible to provide fake response to the information requested by these forms. Using non-standard forms for login has usability and accessibility issues. However, due to their security weaknesses, recently, there has been a lot of negative attention towards using JavaScripts for login purposes. In fact, users have been advised to turn off JavaScript while browsing [57]. For these reasons, most of the legitimate Websites do not use non-standard login forms. Phishing Websites also do not use non-standard login forms because it makes them more evident to users or phishing detection tools if their underlying content and appearance become deviated from that of

93 the targeted legitimate Website. Phishers can employ robot detecting schemes like CAPTCHA (Completely Automated Public Turing tests to tell Computers and Humans Apart) in their Websites to subvert the tool’s effort to enact the responses of the legitimate users. Again, due to the usability issues, this is not a problem as CAPTCHA is widely used for preventing automated registration rather than user validation. At the time of login, most legitimate Websites use CAPTCHAs to only protect from dictionary attacks displaying them after multiple failed login attempts. There might also be legal ramifications of PHONEY consuming the Websites’ bandwidth and computation power for its detection purposes. Though the traffic can be contained by the use of caching, like Web crawlers they also should operate with caution, to not violate any Website’s terms of usage.

4.2.9 Potential Improvements to PHONEY

The limitations of PHONEY stem from the fact that PHONEY is a client-side tool, and requires no assistance from the server for its operation. With server-side assistance, it is possible to partially eschew man-in-the-middle attacks. If PHONEY and legitimate servers agree upon the set of hon- eytokens beforehand, it might be possible for legitimate Websites to identify man-in-the-middle attacks, especially when phishers supply honeytokens into them. The set of honeytokens can be either shared offline or generated online in a dynamic manner. For example, some online banks use transaction numbers, which are sent to users as hardcopy via regular (snail) mail [114], as tokens for out-of-band (OOB) authentication. A user is required to provide the correct transaction number along with username and password to log into the legitimate Website. In case if the transaction number is invalid or does not match with the corresponding username, login is denied. Using a similar offline mechanism, honeytokens can also be sent to users via snail mail on a periodic basis. However, sharing the honeytokens in a static fashion is not feasible, particularly when the num- ber of users is large. In order to address this issue, techniques have been proposed that generate honeytokens in a dynamic manner such that only the legitimate server and PHONEY can identify

94 them [129]. On the client side, PHONEY is deployed in between MTA and MUA so that every incoming e-mail can be intercepted and processed. In order to enable server side assistance, it is required that the same set of honeytokens or honeytoken generation mechanism is deployed in both client and server. Unlike other approaches that alter the mode of authentication, PHONEY is lightweight as it only alters the inputs used in authentication. One of the impeding factors for the large scale adoption of PHONEY is that it is possible for an attacker to launch denial-of-service attacks by sending out large number of e-mails with the URLs that refer to login page of legitimate Website. This can be overcome by placing sites that have been already verified into a whitelist shared by all the PHONEY clients, thereby avoiding redundant checking and reducing server-side bandwidth.

4.3 Classifying Spam E-mails based on linked-to Webpages

Analysis

Unlike PHONEY, which detects phishing e-mails by examining the behavior of linked-to Websites, the structural and content of linked-to Websites can also be analyzed to classify spam e-mails. Even though spam e-mails do not directly pose a threat to users’ security and privacy, nevertheless they are considered to be a significant nuisance. Recent research has shown that users typically spend 17% of their e-mail time deleting these unwanted e-mails. As a countermeasure, in this section, a novel methodology to classify spam e-mails by analyzing the linked-to Websites is presented.

4.3.1 Approach Overview

In the section, a detailed background on the problem is first provided. Then a brief description of features used in detection and the classification algorithms is presented. Lastly, the proposed ap- proach is evaluated against “live” spam e-mails and its performance is compared with two popular open source anti-spam tools.

95 Background

Current spam filtering approaches have evolved from simple rule-based systems into complex ma- chine learning models that treat spam detection as a text classification problem. During the training phase, each spam e-mail is broken down into a set of tokens (keywords). These tokens are used as input data to generate discriminative models, which can be later deployed to separate spam and ham (legitimate) e-mails apart. However, as these filters operate only on keywords present in mes- sage’s body, they can be bypassed by encoding the underlying content as attachments in different formats such as JPEG, PDF, ZIP, etc. Also, such filters are susceptible to poisoning attacks, where spammers superimpose the message body with large amount of unrelated or gibberish text diminishing classifier’s performance. In order to tackle this problem, more recent studies have focused on detecting spam based on non-textual features such as analyzing the validity of URLs contained in them. This is done by either comparing them against a list of blacklisted domains or calculating the probabilities of them being a part of spam e-mails based on prior statistics. How- ever, with the help of botnets, spammers can regularly change the embedded URLs rendering such list based detection mechanisms impractical. The work presented here differs from aforementioned approaches as it focuses on detecting spam e-mails by analyzing the Websites. The main premise is that even if the URLs are changed frequently, the corresponding Website would contain similar content that are common to spam e-mails. An alternative technique to mitigate spam e-mails involves taking down or throttling servers that spew out spam e-mails. This can be achieved by putting abusive servers into DNS blacklists (DNSBL) or cutting down their privileges at an ISP level. Unfortunately, capture and analysis of network traffic is a tedious process requiring infrastructure level support. Cryptography and repudiation mechanisms have also been proposed to prevent the growth of spam e-mails. Similar to network analysis techniques, as they mandate changes to be made at an infrastructure level, they cannot be put into immediate use. The work presented here differs from these approaches as it does not require any change at the network level. Therefore, filtering of e-mails can be done at

96 local mail server or mail user agent (MUA).

4.3.2 Scope of Detection

In order for the detection mechanism to work, it is imperative that spam e-mails contain URLs linking to external Websites. While it is possible for a spammer to send out blank or text-only messages, such an approach would only detract the effectiveness of spam e-mails, and fail to sustain in long run. This approach cannot be used to detect phishing Websites because they need a proactive detection methodology, which instead of examining the Websites’ layout analyze their behavior. Another problem closely related is Web spam detection. Owing to monetization of search engines, large number of obscure domains has sprouted out trying to manipulate the prominence of resources indexed by search engines. Typically, as these Webpages exist only to boost their relative page rank, they are not generally associated with spam e-mails.

4.3.3 Features Used In Detection

There are three textual features, four syntactic features, and two behavioral features used in the classification. In this section, a detailed overview of these features along with their role in the classification process is discussed. The features used are not just extracted from the target Webpage alone; other Webpages in the same domain, which are transitively linked from the target Webpage (with depth ≤ 2) are also considered.

Textual Features

The textual features used in the classification are broken down into three categories depending on the part of the Webpage they are assimilated from.

• N-gram words – The most common type of feature used for spam e-mail and Webpage classification tasks is bag-of-words where words occurring on the Webpage are treated as

97 features. In the simplistic form, presence (or absence) of certain keywords are used as ba- sis for classification. For example, since words such as Viagara, OEM, debt, etc., are intrinsic to spam Webpages only, they can be applied to separate ham and spam Webpages apart. Also, in order to remove commonly occurring words and convert each word to its etymological root, stopword elimination and stemming are applied. In order to obtain a tight correlation between each word and words adjacent to it, 2-gram and 3-gram features are also considered.

• Tokenized anchor text – The anchor text represents the text that appears in a HTML link. In other words, it is the text occurring between and tags. Anchor text is widely been used as features in classifying spam Webpages. This is because spammers, in order to improve a Webpage’s popularity or search engine index score, embed multiple links that enlist the various spam products advertised in it. The extracted anchor texts are tokenized and used as 1-gram features.

• Tokenized title text – The text appearing in the HTML Webpage title usually presents a short synopsis of the kind of product advertised or sold in it. The title text is enclosed between the and tags. Similar to anchor text, it is also tokenized and used as 1-gram features.

Syntactic Features

There are four syntactic features used in the classification represented by their corresponding con- tinuous real valued variables.

• Number of Words present – As the name indicates, this feature gives a measure of average number of words present in every Webpage present in a spam Website. As spam Webpages tend to be more cluttered, typically they have more number of words per page than ham Webpages.

98 • Number of links present – This features gives a measure of average number of links present in spam Webpages. Analogous to number of words, the number of links present in spam Webpages tend to be more than ham Webpages. As most links have anchor text embedded in between, the number of links present is same as the number of anchor text present in a Webpage.

• Compression ratio – The compression ratio is used to identify the repeatability of text in spam Webpages. In order to either increase the page rank or advertise effectively, spammers tend to compose Webpages using a fixed set of keywords. Therefore, this syntactic trait intrinsic to spam Webpages is captured by compression ratio, which is computed as follows:

total # of unique words compression ratio = total # of words .

• Average word length – The average word length is a syntactic feature that identifies abnor- mal Webpages depending on their word length. It is given by the following ratio: average

total # of characters word length = total # of words .

Behavioral Features

• Number of redirections – Spammers usually cloak the target Webpage using multiple layers of redirection. Redirection is used to bypass filters that just examine the validity of the first Webpage instead of the target Webpage. HTTP servers can forward clients into the target Webpage by automatically setting the appropriate status codes (i.e., HTTP 302, HTTP 303 or HTTP 307) in the Web server’s response. JavaScripts that get invoked in the client- side can also be used for this purpose. Recent research has shown that spammers use an average of 1.2 forwards per URL [10]. The spam dataset considered in this dissertation has an average of 0.56 forwards per URL with some URLs having up to 9 forwards.

• Is blocked by DNSBL? – This is a binary feature with values 0 and 1 representing ham and spam Webpages respectively. If a hosting domain or majority of the links present in a Website are blacklisted by domain name system blacklists (DNSBLs), then a value 1 is

99 assigned to the corresponding Website, else 0 is assigned to it. Although they might seem effective, DNSBLs cannot be used as the sole criteria for classification due to their high false positive rates. Usually, DNSBL are also not properly regulated making it possible for spammers to get access to them and block out legitimate Websites.

4.3.4 Classification Algorithms

In addition to three classifiers used in 3.6.2, K-Nearest Neighbor (KNN) and Neural Networks (NN) are also used in classification. A brief summary of their working is given below: K-Nearest Neighbor – K-Nearest Neighbor (KNN) is an instance based learning technique used in text classification. It classifies a document based on closest training examples in the feature

space. The input consists of a set of training examples of the form S n = {(~x1, y1), (~x2, y2),..., (~xn, yn)},

N where each ~xi ∈ R represents the feature vector and yi ∈ {±1} indicates if the feature vector is a positive or negative example. Given a test vector x~?, the goal is to find k training samples such that their distance to x~? is minimum. The distance between x~? and another point x~i, ∃x~i ∈ S n is given as follows:

X D(xnew~ , x~i) = w jδ(~x? j, x~i j) (4.1) j

The distance metric δ(~x? j, x~i j) can be computed by different techniques. One basic version for continuous and discrete features would be:

  1 if feature is discrete and x? j = xi j   δ(x~? j, x~i j) = 0 if feature is discrete and x? j , xi j (4.2)    |xi j - x? j| if is continuous

Other similarity metrics like Euclidean distance and Cosine similarity can also be used to cal-

culate the distance. Once the k nearest neighbors are selected, y? is assigned a class label depending

100 Figure 4.10: An example showing KNN classification with k = 3. The red denotes the test samples (x~?) after the assignment of correct class labels. The symbols and + denote the two class labels used in classification on their class labels. Since classifying e-mails based on spam Websites is a binary classification task, it is preferable that k is chosen to be a odd number so that labels of majority of the k points

can be assigned to y?, as shown in Figure 4.10. Rather than the number of points, similarity of k

points close to x? can also be used to assign class labels. K-NN is preferred in classification due to the fact that it is computationally inexpensive re- quiring no model building during training. In fact, it has been widely applied to classify Web documents based on their content and structural layout. Neural Networks – Neural networks derive their name from network of interconnected neurons

in the human brain. Neural networks consists of a set of processing units U = {u1, u2,..., un} and arcs between two processing units L = {(i, j)|ui ∈ U, u j ∈ U} ⊆ UXU. Each unit receives input from the other units via incoming links or outside and sends output to the other linked units or to the outside. The input-output relation at each unit ui can be defined as follows using the output

101 output layer

} hidden layer

input layer

Figure 4.11: A neural network with feedforward links and three layers

function gi and the state transition function fi:

Output function: o j(t) = g j(a j(t)) X State transition function: a j(t + 1) = f j( w jioi(t)) i

where a j(t) and o j(t) represent the state of unit j at time t and the output j at time t, respectively. w ji represents the weight of the link from processing unit i to j. The input output relation is dependent on the state of the processing unit, which is dynamically changed by learning from the training examples. Conceptually speaking, as shown in Figure 4.11, a neural network has three layers viz. 1) an input layer which receives stimulus from the environment. In this example, it consists of fea- tures that form the positive and negative examples applicable to spam Websites; 2) an output layer that maps the input to one of the predefined classes; and 3) one or more hidden layer that consists on intermediate processing units, which encapsulates state transitions from input layer to output layer in accordance to the learned state transition function. The learning can be via feedforward or backward propagation mechanisms. In backward propagation learning, as opposed to feedforward learning, there can be feedback links from processing units closer to the output layer to the pro- cessing units closer to the input layer. In this chapter, multilayer perceptron, a feedforward neural

102 network, is used. Refer to [73] for more rigorous explanation on neural networks.

4.3.5 Experimentation

Corpus

The experimentation is carried out with two e-mail datasets, one consisting of spam e-mails and another consisting of ham e-mails. As with PHONEY, the experimentation requires that the Web- sites linked from URLs present in e-mails are live and active. It is important to note that unlike phishing Websites, spam Websites are not considered criminal, thereby are not actively sought af- ter and taken down by Internet law enforcement agencies. As a result, most spam Websites have better longevity than Websites that host online scams such as phishing [89]. The spam dataset is obtained from Spam Archive [2], a popular and free spam feed, which is updated on a daily basis. It consists of a total of 1690 e-mails that link to 1346 unique Websites obtained over a pe- riod of two days in August 2008. The ham dataset is constructed from 6950 SpamAssassin’s ham e-mails [109]. Around 4952 of the 6950 ham e-mails have URLs referring to 1536 Websites. The linked-to Websites are downloaded using the wget UNIX utility.

Methodology

In order to make the testing rigorous, experimentation is carried out on three different feature set

configurations: (i) F3gram consists of all the syntactic and behavioral features. However, frequently

occurring words in a Website’s body are captured as 3-gram features; (ii)F2gram is similar to F3gram

except that the frequently occurring words are captured as 2-gram features; and (iii) In F1gram, frequently appearing words are not grouped together, and are considered individually. As before 10-fold cross validation is used where the entire dataset is divided into 10 subgroups randomly. One subgroup is used for testing and remaining 9 subgroups are used to form the ground truth.

103 Table 4.2: Performance of machine learning algorithms when tested against spam and ham linked- to Websites

Feature Set Detection Detection False Positive Precision Recall f1 Used Algorithm Rate Rate statistic Naïve Bayes 91.8 4.2 93.6 91.8 92.1 SVM 98.8 2.2 98.8 98.8 98.8 F3gram Decision Tree 97.3 5.6 97.3 97.3 95.4 K-NN 98.8 3.9 98.8 98.8 98.8 Voted Perceptron 79.9 63.1 75.8 79.9 76.6 Naïve Bayes 87.6 1.6 87.6 92.1 93.7 F2gram SVM 98.8 2.4 98.8 98.8 98.8 Decision Tree 97.2 5.9 97.2 97.2 97.2 K-NN 98.8 3.7 98.9 98.8 98.8 Voted Perceptron 77.9 67.6 73.18 77.9 60.7 Naïve Bayes 91.8 4.2 93.69 91.8 93.8 F1gram SVM 98.5 3.4 98.8 99.4 98.9 Decision Tree 97.1 6.3 97.1 97.1 93.8 K-NN 99.1 2.9 99.0 99.1 99.0 Voted Perceptron 77.7 66.3 73.3 77.7 74.6

Results

Table 4.2 shows the performance of the classification algorithms in detecting spam Websites. All algorithms except voted perceptron show overall good detection and false positive rates. Also, since there is a marked difference in the content of spam and ham Websites, 1-gram features seem to perform in par with 2-gram and 3-gram features. SVM exhibit very low false positive rates when compared with the other four algorithms. On the other hand, K-NN tend to perform the best with very high detection rates because of the fact of spam Websites tend to be closely related to each other. Since K-NN does not involve estimatation of prior probabilities it is also relatively fast when compared with the other algorithms. Voted perceptron tends to perform poorly with low detection and high false positive rates making it unsuitable to detect spam Websites. Vipul Razor had only detection rates of 69.2%. As Vipul Razor expects that a hash of spam e-mail is already present on its server, e-mails with non-matching hash values are seldom detected. This is also the

104 case with image spam and spam e-mails with obfuscated text. However, its false positive rate with Vipul Razor was low. In fact only 12 of the 1536 was marked as spam. SpamAssassin on the other hand mainly concentrate on features present on the e-mail headers. Vipul Razor can also be added as a plug-in to SpamAssassin. Moreover, it also has the neural networks and Bayesian filtering capabilities that can be tweaked to customized settings. In the experiments, SpamAssassin was used in the learning mode. Spam Assassin was able to detect 89.8% of spam e-mails with 1% false positives. This significantly low false positive can be attributed to the fact the ham dataset was actually obtained from the SpamAssassin Website.

4.4 Summary

This work focuses on detecting spurious e-mails by analyzing the behavior and the layout of Web- sites referred by URLs contained in them. To begin with, difficulties associated with the detection of image-based and obfuscated spam e-mails are highlighted. In order to overcome these difficul- ties, PHONEY, a client-side tool is proposed to detect phishing Websites using challenge-response analysis. PHONEY is tested against 274 live phishing Websites. Experimental results demonstrate that it is able to detect all the “live” Websites, which have HTML forms, with zero false positives. Behavioral testing cannot be used to detect Websites that are linked from spam e-mails. In this re- gard, a set of features that accurately characterizes the Websites referred by URLs in spam e-mails is identified. Once the features are identified, classification is done by employing five different machine learning algorithms. Experimental results again show that even spam e-mails can be de- tected with high accuracy by analyzing their linked-to Websites’ properties. On a general note, identifying Webpages referred by spurious e-mails also provides a better insight on the credibility of domains that host them, thus allowing for more preventive actions.

105 Chapter 5

Detecting Information Leak Based E-mail Threats

“If knowledge is power, clandestine knowledge is power squared; it can be withheld, exchanged, and leveraged.”

− Letty Cottin

5.1 Introduction

In this chapter, two facets of information leak that affect day-to-day e-mail users are discussed. The first part of this chapter deals with detecting evasive spywares that stealthily leak user browsing characteristics to external attackers, where as the second part deals with preventing a passive threat, where a user inadvertently leaks out information by sending misdirected e-mails to unintended recipients. The rest of this chapter is divided into two components that address each of these threats in a piece-meal fashion. The e-mail based threats discussed thus far have been treated under the assumption that attack- ers who trigger them do not have any prior knowledge about the potential victims. However, this is not always the case as these threats can also manifest in a context aware manner where an attacker could gain access to information about users’ browsing history and affiliated financial organiza- tions, and apply them to launch more serious “targeted” attacks. This is particularly pertinent to

106 phishing e-mails. Recently, a sophisticated form of phishing attacks, known as spear phishing, has emerged where the phisher uses some knowledge learned about individuals in order to send customized spoofed e-mails. Such highly customized e-mails implicitly have increased trustwor- thiness, and can effectively fool users in responding to them. Furthermore, these e-mails can bypass even the specialized e-mail filters like CUSP (introducted in Chapter 3), which tags e-mails sent from unknown financial organizations as suspicious. A user study by Jakobsson et al. [63] reveals that such context aware attacks are more effective when compared to normal phishing attacks with the success rate of users falling prey to them ranging from 48% to 96%. These context aware attacks are usually carried out in three phases: (i) in the first phase, attack- ers harvest information about users’ browsing history and their affiliated financial organizations using various side-channel mechanisms. For example, an attacker could extract the details about each user from publicly open social network Websites or pay third-party advertisement agencies authorized to disseminate such information. Even though such soft attacks on users’ privacy some- times yield the necessary information to launch context aware attacks, they are limited in scope. This is mainly due to the fact that the size of the set of potential victims whose information can be mined using these attacks is very small. The most feasible way, however, is to employ malwares that leak out users’ browsing activities or e-mail content without their knowledge; (ii) in the second phase, attackers put to use the harvested information to compose targeted e-mails that appear to come from legitimate financial institutions. Since these e-mails are tailored on a per user basis, the success rate of these e-mails is high when compared to normal phishing attacks; and (iii) in the third phase, attackers collect sensitive information from users divulged by them in spoofed Web- sites. So far, most defense mechanisms focus primarily on stopping phishing attacks in the last two phases, and very few efforts have been proposed to curb leak of user information by malwares in the information gathering phase. There are a variety of ways through which malwares can be installed on users’ computers. For instance, attackers could exploit vulnerabilities present in the computers and install the malware without their knowledge. Users on the other hand could inadvertently run malicious executables

107 capable of hooking themselves onto their systems. Also, the capabilities of these malwares may vary with respect to their mode of operation. Keyloggers log keystrokes for the purpose of sur- reptitiously capturing username and password. Rootkits may track user activities, and yet remain hidden so that defense mechanisms cannot detect them. The more common threat is, however, from spywares that stealthily leak user browsing activities to remote hosts. Unlike trojan programs and rootkits, they do not have complete control over the system. Instead, they hook atop users’ browsers as toolbars or browser helper objects (BHO), and sniff outbound HTTP requests and browser cache. Also, unlike other malware that are outright illegal, legality of spywares is widely disputed. As spywares usually are bundled with legitimate softwares, they are installed by users who do not thoroughly read the end user legal agreement (EULA) [55]. Nevertheless, due to the risk posed to users’ privacy, most present day anti-spyware systems brand such spyware programs illegitimate, and actively attempt to quarantine them. There have been schemes [61, 64] proposed that prevent spyware from sniffing browser cache by setting adequate access control rules. Despite these defense mechanisms, the threat from spyware that monitor and leak user browsing activi- ties still remain at large. Although traditional detection methodologies employing signature and anomaly based systems have had reasonable success in stopping spywares that leak out browsing characteristics, new class of spywares emerge which blend in with user activities to avoid detec- tion. The latest anti-spyware technology consists of a local agent that generates honeytokens of known parameters (e.g., static HTTP requests) and tricks spyware into assuming it to be legitimate activity. However, such static honeytoken based techniques can be easily circumvented by next generation spyware by means of data mining algorithms like associative rule mining [8]. In this regard, this dissertation addresses the deficiencies with static honeytoken techniques, and takes a step further by proposing a novel randomized generation mechanism to mitigate even this next generation spyware. Experimental results show that static honeytokens are detected with near 100% accuracy whereas randomized honeytokens are similar to realistic user activity, and hence, are indistinguishable. Unlike spywares that harvest and employ user browsing characteristics to launch targeted at-

108 tacks, a user may also inadvertently disclose personal information to unintended set of recipients resulting in a more direct information loss. This is even more applicable in a corporate environ- ment where an inadvertent disclosure of private corporate documents to outsiders via e-mails can result in lawsuits, tarnishing of reputation and financial damages. Traditionally, such information leaks are mitigated in an ex post facto manner. In the case where the culprit e-mail is not avail- able for reference, using the information that has been leaked out as a stepping stone, internal mail audit logs and sent messages are examined by system administrators and forensics analysts to determine the account through which the leak has happened. However, such after the fact so- lutions have two potential pitfalls: (i) these solutions only serve to identify the careless users who caused information leaks, the actual damage caused due to the disclosure is often irrevocable; and (ii) more so, in situations where e-mail audit logs are not maintained, owing to the lack of non- repudiation mechanisms in e-mail infrastructure, users can altogether deny that the leak happened from his e-mail account, thus avoiding the consequences of their actions. Hence, it is important that information leaks are prevented before they actually occur. Rather than users mistyping the recipients’ addresses, information leaks can also take place due to over-aggressive e-mail clients that incorrectly complete partial e-mail addresses when users type them. Even though this problem has existed from the beginning of e-mail, Carvalho and Cohen were the first to study and address it in a practical manner [24]. In their seminal work, they formulated detecting information leak via e-mails as an outlier detection problem where characteristics of past e-mail messages sent by a user to a recipient are used as basis to discriminate future e-mail messages that vary significantly from the past messages. However, classification of e-mails sent to “leak-recipient” was based on textual and social network features only. The stylometric and structural characteristics of e-mails that encompass the nature of communication between two parties was ignored in their approach. For instance, a personal e-mail differs markedly from a professional e-mail in the size, composi- tion, format, etc. As a result, their classification technique using the textual content could predict information leak on an average in only 55% of the test cases. In order to increase the performance, this dissertation explores the effect of employing a variety of structural and stylometric features

109 on the detection process, while keeping the same experimental setup and methodology as given in [24]. Th

5.1.1 Contributions

The main contributions of this chapter are two-fold:

• A Randomized Honeytoken Based Methodology to Detect Evasive Spyware – A novel methodology to detect evasive spyware that steals users’ browsing activities is presented. In this regard, first, a new class of “intelligent” spyware called SpyZen that not only by- passes anomaly detection systems like [20], but also actively defeats static honeytoken based schemes is presented. Then, a novel randomized honeytoken based technique SpyCon that builds on prior works [20] is proposed to detect the class of SpyZen-like threats.

• Preventing Information Leak via E-mails Using Stylometric and Structural Features – The feasibility of employing stylometric and structural features in preventing information leak via e-mails is proposed. Three set of features that characterizes the surface level proper- ties of text at the sentence, word and character level are identified. Experimentation on real world e-mail corpus reveals that the proposed technique can detect over 78% of synthesized information leaks outperforming the other existing technique [24].

5.1.2 Chapter Organization

The remainder of this chapter is organized as follows: In Section 5.2, first, an overview of evasive spyware that leaks out users’ browsing activities is provided. Then, the weakness of the state- of-the-art honeytoken based anti-spyware schemes is highlighted by synthesizing a new class of intelligent spyware that can easily blindside them. Finally, a novel randomized honeytoken gen- eration scheme is proposed as a countermeasure to detect and mitigate even this new class of intelligent spyware. Section 5.3 elaborates on detecting information leaks, which could happen as

110 a result of users sending e-mail inadvertently to unintended set of recipients using structural and stylometric features. The concluding remarks of this chapter are provided in Section 5.4.

5.2 Spycon: Emulating User Activities to Detect Evasive

Spyware

This section begins by providing the necessary background on the operation of evasive spywares that surreptitiously steal user browsing activities, and existing state-of-the-art techniques that have been proposed to mitigate them. The limitations with existing honeytoken defense mechanisms are then presented in a formal manner, and are used as foundation while implementing SpyZen and SpyCon.

5.2.1 Background

Potentially Unwanted Programs (PUPs) is the collective term given to programs whose presence poses a serious security and privacy threat to users [84]. This includes malware, spyware, adware and other myriad programs. The commercial incentives of these programs are lucrative enough for this ‘industry’ to thrive, and according to some projections [84], are expected rise at exponential rates in the future. Furthermore, a recent study estimate that 13.4% of 21,200 executables down- loaded from 18 million Web pages were infected with spyware programs [90]. The success of any spyware on a system is determined by its ability to evade detection. Towards this goal, early spyware had the advantage of user ignorance and lack of security mechanisms or tools to detect and remove them. Since then, various anti-spyware mechanisms like toolbars, various detection and removal tools, etc., have been developed. These defense solutions employ either signature based or anomaly detection (flow based) philosophies. Even though signature based systems have the advantage of detecting known spywares with a high degree of accuracy, they are incapable of detecting novel threats. Anomaly detection schemes, on the other hand, are capable of detecting

111 new threats with reasonable accuracy. They operate on the premise that any behavior observed in a system that deviates from the ‘normal’ behavior is indicative of the presence of unauthorized processes. But, as the defenses mature, so do the attacks. Spyware authors have responded to these defense mechanisms by altering their code to create new families and variants that are either incremental updates or binary obfuscation of earlier code. Such ‘new’ spyware has an inherent advantage against signature based analysis since a known profile of spyware is not immediately available for comparison. Hence the major threat for new spyware is primarily from anomaly de- tection systems, which compare the normal behavior of systems (based on parameters like websites visited, network connections initiated, etc.) to the behavior of the systems infected with spyware. To counteract these anomaly detection schemes, current spywares attempt to blend in with legiti- mate behavior. For example, a spyware may contact its remote home servers only when it detects user activity, thereby blending in with the ‘normal’ profile. To address these kinds of threats, the work proposed a system where a local agent would generate a sequence of honeytokens. Honey- tokens are defined as “honeypots that are not computer system” whose value “lies not in their use, but abuse.” In the context of this chapter, a set of network requests generated by the local agent (to emulate legitimate user activity) represent honeytokens. The idea behind this agent is that spyware, if present, would operate when the local agent generates the honeytoken, thereby giving itself away. For example, a spyware may operate only when it detects user activity and remain passive otherwise. In other words, the local agent called Siren generates honeytokens that mimic user activity during the idle periods; spyware operates assuming the user is active and hence can be caught by a network based IDS. This is the current state-of-the-art in terms of spyware detection. Given the history of spyware creation and operation, it is quite likely that the next update of spyware will attempt to bypass this mechanism too. The work in this chapter presents a methodol- ogy whereby simple mechanisms like honeytoken generation as in [20] can be detected by a new class of spyware called SpyZen. The basic concept behind the proposed scheme is quite intuitive. SpyZen spyware operates in three stages. The first stage is an ‘install and observe’ stage, where spyware merely listens to the sequence of events, of which certain portions are honeytokens. In

112 Local Agent

Local Agent Stream Honeytoken HTTP Requests Honeytoken Outgoing HTTP HTTP Requests Requests Network Intrusion Detection Systems (NIDS) User Initiated Local Host HTTP Requests

User Initiated Raise an alarm if HTTP Requests Honeytoken HTTP Requests are tampered User

Figure 5.1: Working of honeytoken based spyware detection mechanisms the second stage of analysis and inference, spyware detects the honeytoken sequence. Using data analysis algorithms like associative rule mining algorithm, spyware can infer the honeytoken gen- eration. The third stage is the actual operation stage, where spyware operates only when the hon- eytoken sequence is not detected. Then a defense mechanism against this new class of spyware, called SpyCon, which utilizes a randomized honeytoken generation scheme is presented. Thus, the solution presented here takes as its point of departure from the work by Borders et al., where the authors present a honeytoken based defense mechanism against spyware that operates only when there is user operation (thereby masking its network access under the guise of user operation) [20].

5.2.2 Problem Statement

Consider the normal browsing activity of the user on a local host. This could be abstracted as a series of outgoing and incoming HTTP requests. An external network intrusion detection system (NIDS) and a local agent operate in conjunction to monitor and classify the normal operation of the local user. As shown in Figure 5.1, the local agent (hereafter referred to as LA) generates a honeytoken

113 (a sequence of HTTP or network requests) that is known to the NIDS. This generation is initiated only if there is no user activity. The first problem, from the viewpoint of a spyware, is to differen- tiate between real user activity sequence and the honeytoken sequence. The second (and the main) problem from the viewpoint of the LA, is to generate a sufficiently random honeytoken sequence so as to prevent spyware from inferring the honeytoken. This was suggested as an open problem in [20], where the authors aptly named it as passive reverse Turing test. Although the methodology suggested here does not solve the open problem per se, the inadequacy of deterministic honeyto- ken generation mechanisms is highlighted and randomized honeytoken generation algorithms to overcome it is proposed.

Problem Formulation

Towards constructing a formal representation of the problem, the following terms are defined.

Definition 5.1. A honeytoken session Nh is defined as a sequence of n HTTP requests such that

h h h h N = {Nreq−1, Nreq−2,..., Nreq−n} where the following properties hold true:

h h h h 1. ∀1 ≤ i ≤ j ≤ n, time(Nreq−i) < time(Nreq− j). That is Nreq−i is made before the request Nreq− j.

h h 2. ∀1 ≤ i ≤ j ≤ n, the sequence between Nreq−i and Nreq− j is known or predetermined.

Definition 5.2. A user browser activity session Nu is defined as a sequence of n HTTP requests

u u u u N = {Nreq−1, Nreq−2,..., Nreq−n} where the following properties hold true:

u u 1. ∀1 ≤ i ≤ j ≤ n, time(Nreq−i) < time(Nreq− j).

u u 2. ∀1 ≤ i ≤ j ≤ n, there exists no relation between Nreq−i and Nreq− j.

Definition 5.3. A recorded HTTP requests pattern Nr is a mixed sequence of network accesses

r r r Nr = Nrec−1, Nrec−2,..., Nrec−n where:

r r 1. ∀1 ≤ i ≤ j ≤ n, time(Nreq−i) < time(Nreq− j).

r h u 2. ∃i, Nreq−i ∈ {N , N }.

114 Thus, the recorded sequence is a mixture of the honeytoken sequence and legitimate user ac- tivity. The problem, therefore, is to separate a given recorded HTTP request pattern (in real time, as a stream) into its constituent components, viz. the honeytoken sequence and the legitimate user activity. This can be defined as follows.

Problem Statement 1. Devise an algorithm Algo-Find-Honeytoken (A-FH) such that:

A-FH(Nr) = {Nh, Nu} where the input is a recorded HTTP request access pattern Nr and the output is the mixed sequence decomposed into Nh and Nu. However, this requires identification of both honeytokens and user activity. Furthermore, this precludes other HTTP requests like automatic software updates (win- dows updates), toolbar activities (Google toolbar), etc. From the viewpoint of spyware, what is required from the recorded network stream is the extraction of only the honeytoken sequence, so that spyware may refrain from operating in those durations. This leads to the practical problem formulation defined in definition 2.

Problem Statement 2. Devise an algorithm Algo-Find-HoneyTokenTime (A-FHT) such that:

A-FHT(Nr) = {Nh}

h h Hence, spyware can stop operating from t1 through tn where t1 = time(Nreq−1) and tn = time(Nreq−1). Note that A-FHT(Nr) will defeat the spyware detection.

5.2.3 Design of SpyZen

The goal here is to show the inadequacy of static honeytoken based anti-spyware techniques by cre- ating a new generation of spyware, SpyZen, by slightly modifying the existing spywares’ operation. As given in the previous section, the effectiveness of honeytoken based anti-spyware mechanisms depend on their ability to mask the honeytoken sequence so that it is hard for the spyware programs

115 to deduce them. However, since static honeytoken based approaches employ the same sequence of network requests repeatedly, the probability of correlating a sequence of network requests to Nh becomes very large. Hence, a frequent observation of the same set of network requests is a strong indication of occurrence of honeytoken sequence. According to the definition of evasive spyware, the spyware programs leak out users’ informa- tion only in the presence of user activity. Here, the scope of the threat vector is limited by assuming that the presence of user activity can be perceived only by observing the network requests made, and not by other schemes such as key strokes and mouse movements monitoring. Furthermore, given the characteristics of static honeytoken, SpyZen is designed to infer honey token sequence Nh from observed network requests Nr using associative rule mining, which is well-suited for mining frequent item sets and relationships that exist between them in large sample of data. As a simple and powerful tool, associative rule mining has been employed in many areas of security such as intrusion detection and spam email detection. Associative rule mining is used for finding relationship between the occurrences of itemsets within transactions, i.e., given a set of items and a set of transactions, rules are generated which link the otherwise disjoint item sets. As- sociative rule mining was first proposed by Agrawal et al. [8] for market basket data or transactional data analysis to determine which items are most frequently purchased together. In the remainder of this section, the basic concept behind associative rule mining and SpyZen are presented. Then details on how associative rule mining can be applied to A-FHT is given.

Associative Rule Mining: An Overview

Given the set of disjoint items (I) and a set of transactions (T), where each transaction consists of a subset of the items, associative rule mining is used to determine relationships that exist between the occurrences of items within transactions. Translating into mathematical terms, given a set of items I, an association rule of the form X 7−→ Y is a relationship between the two itemsets X and Y, such that X ⊂ I, Y ⊂ I, X ∩Y = ∅ (X and Y are disjoint) and X ∪Y ⊆ I. Moreover, an association rule is described in terms of support and confidence. The support of an itemset X is the fraction of

116 transactions that contain the itemset as given below.

Number of transactions that contain X Support(X) = (5.1) Total number of transactions

An itemset is called large it its support exceeds a given threshold Supportmin. The confidence of a rule X 7−→ Y is the fraction of transactions containing X that also contain Y.

Support(X ∪ Y) Confidence(X 7−→ Y) = (5.2) Support(X)

The association rule X 7−→ Y holds if X ∪ Y is large and the confidence of the rule exceeds a given threshold Confidencemin. The actual process of associative rule mining is carried out in two steps: (a) all large itemsets appearing in the transaction database are determined, and (b) for each large itemset, say Z, appropriate complementary subsets X and Y of Z are found such that the rule

X 7−→ Y exceeds Confidencemin.

Applying Associate Rule Mining to A-FHT

Here, the notion of sessions (or transactions) in the context of associative rule mining is defined. Each session S is viewed as a collection of network requests made by the users without any pro- longed period of inactivity. Ideally, each user session Nu that represents a sequence of user initiated requests is a part of S . A honeytoken session Nh is initiated after a prolonged period of user idle- ness. The union of all observed network requests constitutes the itemset I. With specified support and confidence thresholds, SpyZen extracts all frequent item sets and generates associative rules to hypothesize about the honeytoken sequence. It is important to note that it might be possible for SpyZen to wrongly consider frequently occurring user requests Nu for honeytokens. However, in such cases, without loss of generality, it can be assumed that SpyZen can choose to either remain passive or leak out the requests by piggybacking with the information gathered during

117 input : I, S , Supportmin, Confidencemin, kmax output: Candidate item set Ck and frequent item set Lk of size k

1 set k ← 1; 2 set Lk ← {frequent item set of size 1}; 3 for k = 1;Lk , ∅; + + k do 4 set Ck+1 = Candidate sets generated from Lk; 5 foreach s ∈ S do 6 increment the counts of all candidates in Ck+1 that are contained in s; 7 Lk+1 is the candidates in Ck+1 whose support at least exceeds Supportmin; 8 end 9 end 10 for k = 1;Lk , ∅; + + k do 11 foreach non empty f of Lk do 12 output rule f 7−→ (Lk − f ) if confidence exceeds Confidencemin; 13 end 14 end Algorithm 2: Algo-Find-Honeytoken A-FHT using Apriori algorithm to infer honey- tokens

other innocuous interval. Apriori algorithm is a classic algorithm that is available in the literature to learn and extract association rules. With this background, the algorithm A-FHT to extract the honeytoken sequence is described in Algorithm 2.

5.2.4 Design of SpyCon

In order to detect SpyZen, it is essential to change the honeytoken sequence after each trial so that it is not possible for the spyware programs to deduce them. To this extent, a simple honeytoken generation technique using random Web spidering is proposed to dynamically generate a different set of network requests each time.

Randomized Honeytoken Generation

The efficacy of the randomized SpyCon technique relies on the fact that a different set of network requests is generated each time to act as dynamic honeytokens; as the network requests vary it is impossible for SpyZen to learn the honeytoken sequence. SpyCon uses a technique similar to

118 randomized Web spidering.

input : X seed Webpage, PRNG_INT(.) Pseudo Random Number Generator Function, S thresh denotes the number of hyperlinks needed for a Webpage to be assigned as seed Webpage output: Set of randomized honeytokens to thwart SpyCon

1 foreach Hyperlink h ∈ X do 2 set k = PRNG_INT(1) /* Randomly generate 0 or 1 */; 3 if k == 1 then 4 Visit the Webpage referred by hyperlink h; 5 end 6 if number of hyperlinks in h > S thresh then 7 set X ←− h; 8 end 9 end u 10 if user activity N is detected then 11 Save state of X, h and suspend; 12 end Algorithm 3: Spidering based randomized honeytoken generation technique

Some Webpage rich in hyperlinks is chosen as a starting point (seed), and then a subset of hyperlinks present in the seed Webpage are randomly visited to generate a completely different set of network requests. The random seed and the pseudo-random sequence to visit the hyperlinks are known only to the local agent and the NIDS. A window of previously visited Webpages is maintained in the event that the spidering leads to dead end with no further hyperlinks. The entire process of generating randomized honeytokens is described in Algorithm 3.

5.2.5 Experimentation

Towards applying the A-FHT algorithm, experimentation was carried out in two different stages. In the first stage, a series of user initiated network access patterns is collated. This was obtained from the history logs of the users’ browser. Specifically, four users contributed their history sessions. Browsing history was collected over a period of 10 days. In the second stage, the honeytoken sequence was inserted in each day browsing history. Two types of honeytokens, viz. a deterministic/repetitive honeytoken and the randomized honeytoken

119 (as defined in Algorithm 3) were used here. The deterministic honeytoken was determined by inserting links to which none of the four users had visited. The honeytokens were then inserted in the periods of inactivity in between user browsing sessions (also known as idle sessions). By analyzing the history of the four users, the start of idle sessions was determined to be 20 minutes away from the completion of last user initiated HTTP request. Even though this interval might vary when considered with another user set, it is to be noted in passing that the presented scheme would work as long as the total idle time is greater than users’ browsing time. Furthermore, a recent study reveals that even the fraction of adult users who spend most time on the Web average only 74 minutes a day. Thus the sequence Nr (recorded access pattern) for 10 days was formed. The algorithm A-FHT was fed with these recorded network access patterns. The deterministic sequence was easily detected by the A-FHT algorithm, with a support of 100% and a confidence of 100%. The support and the confidence are a perfect 100% due to this reason: as there are some periods of time every day when user activity is idle (e.g., lunch time), the deterministic sequence appears in every transaction (or user network activity). The confidence is also high, but for a very different reason. Consider the fact that an average user might browse to a search site (Google) everyday. In fact, this was the case for all the four users’ profile. However, each network transaction was recorded in its entirety, i.e., the complete HTTP request. Hence for a search term “search”, the network link would assume the form http://www.google.com/search?q=search. The exact form of this link varied for the different users dependent on the browser (i.e., Internet Explorer and Mozilla in this case) and the mannerism of launching the search. For example, IE 6.0 users had a different string when they launched the search from the Google Toolbar. IE 7 users used the inbuilt search bar and had the simple form described before. Thus, the exact link varied for each search, and even though certain searches were repeated, the support and confidence level did not approach even 30%. Only the top 5% of the rule (with 95% confidence) for determining honeytokens is considered. On the other hand, for the random honeytoken generation, the support and confidence were nowhere near 30% confidence and support levels, which is similar to other normal user activity, and hence indistinguishable. Although other data mining techniques can be used, associative rule mining

120 has been utilized here to illustrate the utility of data mining algorithms to detect any deterministic honeytokens. The purpose of using the data mining techniques is to bring out the inherent weakness of static honeytoken schemes. It might be possible that spyware program using these techniques incurs heavy operating overhead, thereby giving itself away. However, there exist light-weight algorithms which can infer patterns in data streams with relatively low overhead.

5.2.6 Implementation Issues

Throughout this chapter, the notion of user activity has been represented as a sequence of network requests. Practically, this is rarely ever the case; spyware usually consists of sophisticated moni- toring tools, of which monitoring network accesses is only a small part. For example, the spyware works by logging keystrokes; other spywares may use a combination of keystrokes, network ac- cess, etc. In light of such characteristics, it is apparent that the notion of network requests (only) as user activity is flawed and not practical, since spyware such as can decide if a user is present on the system by checking the keystrokes. Thus, malware can sense “No user Activity”, thereby rendering ineffective agents like that rely solely on network activity as parametric of user activity. However, this raises the interesting question, “What is User Activity?” from a parametric point of view. Towards a practical implementation of SpyCon, the nature of realistic user activity and the vari- ous avenues for emulating them are explored. The notion of user activity must be all encompassing if new threats like SpyZen are to be detected. User activity must be considered in its entirety, i.e., not only processes that are executed by users (and background process activity as a result), but also other parameters such as keystroke activity, mouse movements, interactive sessions, etc. Towards emulating realistic user activity, there is a need to define and create parametric representations of user activity based on GUI that are statistically similar to real datasets of user activity. The local agent must be capable of emulating user activity in terms of these parameters. The basic require- ments of the local agent still remain the same, viz. the local agent should not irrevocably change the system state or interfere with the users’ workflow.

121 In order to fulfill these requirements, the input generation technique for the local agent must have the following properties.

• State Change: To avoid detection by spyware that uses a state change as an indicator of real user activity, the local agent must change the state of the local system in a manner similar to that of legitimate user activity

• Emulation of user activity: The local agent actions will emulate normal user activity in a statistically similar manner. This includes, but is not limited to, keyboard activity, mouse movement, networks access, etc.

• Restoration of state change: Before the real user is active again, the state change initiated by the local agent is reversed.

Note also that an idle user can be determined in a variety of ways. The local agent (and the spyware) may monitor the mouse and keyboard activity and declare the user to be idle when such activity ceases for a specified time period. Still other techniques involve simplistic schemes like detecting if the workstation is locked; to sophisticated ones like workflow process monitoring (like email, browsing, etc.). These issues bring about an interesting point: if the local agent (and spy- ware) can determine if the user is idle by simply determining if the system is locked, then the threat vector shifts completely, and so do the anti-spyware techniques. In such situations, it behooves the operating system to provide a secure means of hooking into specific system events, thereby ensur- ing that only trusted and user approved processes only obtain critical hooks/notifications. Additionally, irrespective of the technique used, there is bound to be a time gap between the user leaving the terminal and the local agent/spyware detecting that the user is idle. This time gap can also be used by spyware to determine that the user is idle (and is probably the initiation point of honeytoken generation). Therefore, although anti-spyware techniques like [20] and SpyCon presented in this chapter may be technically sound, their implementation must receive appropriate focus in order for these techniques to successfully accomplish their goals.

122 5.3 Preventing Information Leak in E-mails Using Structural

and Stylometric features

As mentioned earlier, similar to spyware, even inadvertent leak of e-mails to unintended set of recipients can result in serious privacy. Despite its prevalence, there have been very few efforts proposed to tackle them. In this section, for the sake of motivation, the problem of information leak in e-mails is explained. Furthermore, adequate background details are provided to reiterate the impact of information leak on every day users. The details on simulating information leaks using a real world dataset as presented in [24] is summarized here. Lastly, the effect of structural and stylometric features on the classifiers performance is provided. Information leak via e-mails is not a new problem; in fact, it has been around ever since the inception of e-mail. Traditionally, it relied on human cognitive errors, especially during the time of message composition. How- ever, with the growth of graphical e-mail clients with advanced usability features, which aid users in auto-completing recipient e-mail addresses, information leak has also proliferated to a much greater extent. For example, Mozilla Thunderbird and Microsoft Outlook Express display the list of suggested recipients from the address book whose e-mail addresses or names partially match the characters typed in the TO: or CC: fields. Moreover, in order to enrich user experience, Web based clients also show the list of suggested recipients based on previously sent e-mails. Specifically, the e-mail addresses that often accompany each other are clustered to form a single recommenda- tion set. Subsequently, if a member of the set is entered in the TO: or CC: field by a user when composing a new e-mail, the remaining elements of the set are also shown as recommendations in the e-mail client. Figure 5.2 shows one such example using Gmail’s e-mail client†. These factors, in turn, contribute to the leak of sensitive information caused by sending misdirected e-mails to wrong set of recipients. A report from SurfControl, an e-mail filtering firm, suggests that over 40% of workers surveyed have received confidential information via misdirected e-mails [15]. Such in-

†Source of the image can be found at http://gmailblog.blogspot.com/2009/04/new-in-labs-suggest-more- recipients.html

123 Figure 5.2: Gmail client showing the list of suggested recipients during message composition.

formation leak has led to catastrophic results in a real world setting. A Cisco employee was found guilty of inadvertently leaking out the company’s quarterly earnings report before the actual press release [15]. Another example of information leak via e-mail is quoted in [24]. Apart from the loss of privacy, it has been widely acknowledged that such information leaks have the potential to result in lawsuits and financial losses.

5.3.1 Background

Given the impact on every day users and organizations alike, e-mail clients have started to slowly develop preemptive solutions to mitigate information leak. As a part of information rights man- agement (IRM) effort, Microsoft has developed leak-proof e-mail that restricts recipients from forwarding or printing sensitive messages. With IRM, it is also possible for senders to dispatch messages with a set expiry time so that they get auto-deleted once past it, thus leaving only a small time window for recipients to act upon the information contained in them. In a similar vein, Gmail has also introduced a new feature that allows the senders to “undo” sent messages within a 10 sec- ond time frame. Even though these solutions act as a first line of defense, they are not completely foolproof, and offer solutions that only help in damage control, thereby doing little to stop the in- formation leak from happening in the first place. In this regard, Carvalho and Cohen were the first to address this problem in a formal manner [24]. They formulate detecting information leak via

124 e-mails as an outlier detection problem where characteristics of past e-mails exchanged between a sender-recipient pair are used as a basis to prevent information leaks that may occur between them in the future. A given e-mail message sent by a user to a recipient is considered to be a potential leak if it varies significantly from the past profile. In their model, the characteristics of past e-mail messages are captured using textual features only. In fact, each outgoing e-mail is first converted into a vector space model, and then cosine similarity is used as the metric for classification. Even though the semantics of the message is captured using textual features, they fail to provide an ac- curate characterization of e-mail messages communicated between a sender-receiver pair as they do not encompass the underlying structural and grammatical traits. For example, personal e-mail communication has a different structure/format altogether from professional e-mails. To name a few, it may have a different message signature, salutation message, voice, etc. In this regard, this dissertation proposes to address the limitations in [24] by selecting a variety of stylometric, struc- tural and linguistic features that accurately characterizes past communication between senders and recipients. The detection of information leak is aptly transformed into a text similarity match- ing problem as given in [24]. Experiments conducted on simulated information leak using Enron e-mail corpus reveal that the proposed methodology exhibits better performance when compared with the text classification solution presented in [24].

5.3.2 Scope of Detection

Here, the focus is mainly on detecting misdirected e-mails sent by users to unintended set of recipients. E-mail based information leaks can also manifest in different forms. For instance, it may be also possible for malware present on users’ computers to stealthily send out spurious e- mails from their e-mail accounts. Also, insiders (or intruders) can leak out sensitive information found in e-mails without the users’ knowledge. Also, as these malicious agents are capable of cleaning up their actions, they are, typically, tackled by anomaly or misuse detection systems. The normal behavior in such scenarios is modeled in terms of frequency of usage, user login time, user actions, e-mail statistics, etc., and using textual, structural or stylometric features extracted from

125 the e-mail content as done here.

5.3.3 Features Used in Classification

As mentioned earlier, every e-mail is represented as a vector where the components correspond to various stylometric and structural features. Many of the features used here have been also used in authorship identification [40, 71] where the goal is to map an e-mail message to its actual sender. The features used in characterizing messages sent between a sender-recipient pair can be grouped into three categories, viz. (i) character-level feature set; (ii) word-level feature set; (iii) sentence- level feature set; and (iv) structural feature set depending on the type of processing needed to compute them. The features belonging to each of these categories are given in Tabe 5.1.

5.3.4 Experimentation

The dataset and experimentation setup used in this chapter is directly adopted from [24]. For the sake of completeness, the details concering them are summarized in this section.

Dataset

Even though e-mail is widespread, due to the privacy issues there are only a few publicly available e-mail datasets that can be used for experimentation. Perhaps, the most popular among them is the Enron email corpus. Enron corpus was made available to the public by the Federal Energy Regula- tory Commission as a part of the legal investigation carried against Enron corporation. The original Enron corpus contained 619, 446 e-mail messages belonging to 158 users. However, the version used in this experimentation was prepared by Shetty and Adibi, in which most of the redundant messages appearing in the original e-mail corpus were removed [107]. This sanitized version con- tains 252, 759 e-mail messages belonging to 151 users organized in approximately 3000 folders. As some of the users have multiple aliases (or e-mail addresses), a mapping provided by Andres Corrada-Emmanuel is used to link e-mail accounts belonging to the same user together [37]. For

126 Table 5.1: The list of individual features belonging to character-level, word-level, sentence-level and structural feature set is shown below. C represents the total number of characters in an e-mail. The total number of words appearing in an e-mail is denoted by M, where as V denotes the total number of unique words. The functional words are words that specify the mood or attitude of the speaker. 122 functional words given in [40] are used. Hapax legomena denotes words that appear only once in all e-mails sent to a recipient. L denotes the total number of sentences in an e-mail. Character-level Feature Set Word-level Feature Set Sentence-level Feature Set Structural Feature Set total # of characters in words/C vocabulary richness (V/M) average sentence length has greeting message total # of alphabetic characters in words/C total # of functional words/M total # of blank lines/L has signature message total # of upper-case characters in words/C total # of short words/M total # of sentences > 15 words/L has farewell message total # of digit characters in words/C word length frequency/M total # of short sentences < 8 words/L total # of attachments total # of white-space characters (W)/C functional word distribution total # of question sentences/L position of requoted text 127 total # of space characters total # of words with length > 6/M total # of passive sentences/L total # of space characters/W total # of pronouns total # of tab spaces/C total # of subordinating conjunction/M total # of tab spaces/W total # of coordinating conjunction/M total # of punctuations/C total # of articles/M total # of semicolons/C total # of prepositions/M total # of commas/C total # of adjectives/M total # of adverbs/M total # of interrogative words/M total # of nouns/M total # of verbs/M total # of adjectives/(total # of nouns) percentage of different POS trigrams total # of hapax legomena words/M total # of hapax legomena words/V each user, two distinct sets of messages are generated: (i) sent collection represents the set of all messages sent by the user; and (ii) received collection contains all messages in which the user’s e-mail address was included in the To:, CC: or BCC: fields. The sent collection is chronologically sorted and split into two parts, sent_train and send_test. The send_train set contains 90% of the oldest messages sent by the user. The remaining e-mails are placed into send_test. The same set of 20 users considered in [24] is also used here. An address book for each of the 20 users is created by assimilating the e-mail addresses found in the received and send_train collection. The one dif- ference between [24] and the setup considered in this chapter is that the e-mails are not represented as word vectors; instead, they are represented by the vector of stylometric and structural features extracted from the e-mail content. The final message counts for the 20 Enron users calculated here, as shown in Table 5.2, differ from the values provided in [24].

Table 5.2: E-mail messages in different collections for 20 Enron users

Enron Received Sent User Collection Collection rapp 521 163 hernandez 1232 147 pereira 944 199 dickson 1338 220 lavorato 125 402 hyatt 1871 629 germany 2536 3686 white 1122 491 whitt 933 460 zufferli 482 349 campbell 1523 591 geaccone 1099 440 hyvl 1268 723 giron 1125 1110 horton 1293 474 derrick 1537 763 kaminski 679 1219 hayslett 1823 785 corman 2403 763 kitchen 6716 1504

128 Criteria for Selecting Leak Recipient

Information leaks are seldom documented in real world e-mails. Hence, for the purpose of exper- imentation, they need to be synthesized in a realistic manner from the e-mail corpus itself. Given an e-mail e sent by a user u to a set of recipients R = {r1,r2,..., rn}, the process of selecting the leak recipient is specified using a three step algorithm: (i) in the first step, one of the R recipients, say r? is chosen in a random manner; (ii) in the second step, from the address book of u, an e-mail address is chosen such that its first three characters match with r?. If there is no such match, then the first two characters are considered and so on. If there are multiple matches, then one recipi- ent is chosen randomly among them; and (iii) in the last step, the leak recipient rleak returned in the previous step is also added into the e-mail’s recipients field. The problem then boils down to devising a classification technique that can successfully identify the chosen leak recipient rleak.

Information Leak Prediction

The method of detecting the leak recipient, rleak from the remaining set of R recipients is given as follows: for each of the recipients ri in R and rleak, a normalized feature vector vi is constructed by taking the average of feature values computed in the past e-mails sent from u to ri. Also, a vector ve is constructed from the e-mail message e itself. The similarity score between vi and ve is calculated using the cosine similarity metric [81]. Then, all the users in R and rleak are ranked according to the calculated similarity score. After ranking all the recipients, the recipient with the lowest score is predicted as the leak-recipient. If the predicted leak-recipient is same as the chosen leak-recipient (rleak), then the prediction is considered accurate, else not. The process is carried out for all the e-mails for each of the 20 Enron users. Also, for every user, it’s repeated for N = 10 trials. The overall results are shown in Table 5.3. Also, it can be seen that with stylometric and structural features over 75% of the leak recipients are identified as opposed to around 55% with just using the textual features as shown in [24].

129 Table 5.3: E-mail leak prediction results

Enron Percentage of User Correct Prediction rapp 69.57 giron 79.46 germany 69.52 hyvl 78.37 campbell 76.55 derrick 77.01 kaminski 76.71 zufferli 77.02 hernandez 86.12 hyatt 82.51 lavorato 81.67 kitchen 70.58 whitt 77.30 hayslett 73.49 corman 71.99 dickson 77.41 horton 84.41 geaccone 79.09 pereira 93.62 white 79.33 average 78.09

5.4 Summary

This chapter essentially has two parts, viz. in the first part a randomized honeytoken based mecha- nism is proposed to detect evasive spyware that blends in with normal user activity. The weakness of existing static honeytoken based defense mechanisms are first highlighted to bring out the effi- cacy of the proposed solution. In the second part, information leak via e-mails caused inadvertently by users is addressed. A set of structural and stylometric features are proposed to pinpoint the leak recipient based on the past profiles. Experimental results on a real world e-mail corpus suggest that the proposed solution achieves significantly better prediction rate than other existing solutions.

130 Chapter 6

Automated Vulnerability Aggregation and Response against Zero-day Attacks

“Hence that general is skillful in attack whose opponent does not know what to defend; and he is skillful in defense whose opponent does not know what to attack.”

− Art of War by Sun Tzu

6.1 Introduction

The previous three chapters have focused on filtering e-mail based attacks that pose threat to users’ security and privacy. An equal cataclysmic threat is posed by exploits that target vulnerabilities present in the e-mail infrastructure. Similar to any other software product, mail server software can contain coding flaws resulting in vulnerabilities that can be exploited by an external attacker. In the last three years, more than 50 different vulnerabilities in SMTP software have been reported in National Vulnerability Database (NVD). Even though existing anti-virus and network intrusion de- tection systems (NIDS) quarantine incoming e-mails with malicious attachments, they are mostly signature-based and lack the ability to identify exploits that target newly discovered vulnerabilities. Moreover, payloads of these exploits are not strictly embedded within e-mails; a significant frac- tion of SMTP vulnerabilities listed in NVD can be exploited by attackers by issuing SMTP requests with non-standard parameters that cannot be handled gracefully by a server’s software. The current

131 state-of-the-art in defending against such exploits is to wait until vendor specified patches become available, and then apply them. However, the underlying system is left at risk, especially in the time between disclosure of a vulnerability and release of the corresponding vendor specified patch. This chapter focuses on a proactive methodology to safeguard even enterprise networks having numerous deployed services during this time, also known as zero-day gap. The proposed approach is generic enough so that it can be easily extended to protect all network services’ software, and not just e-mail software.

6.1.1 Motivation

In the quest to keep up with the ever growing demand, software companies continue to design and develop products at alarming rates without paying much attention to security. By and large, secu- rity has been an issue which is worried about only after a vulnerability is publicly exposed. This “penetrate-and-patch” approach where emphasis is on fixing a particular exploit rather than en- hancing the security of the underlying framework has several disadvantages [119]. First, building patches necessitates the system to undergo the entire development and testing life-cycle again. As a result, the present day exploits target the time window between the disclosure of vulnerability and release of its patch to launch successful attacks against otherwise defenseless systems. Secondly, poorly tested patches may introduce more inadvertent bugs [85], thereby prolonging the exposure to vulnerabilities even further. Third, since installation of patches requires system reboot as an undesirable side-effect, network administrators tend to risk windows of time when security is lax by applying patches in delayed intervals [16]. Finally, according to the CERT/CC [3], an average of 20 new vulnerabilities were reported on a daily basis in 2007, making the timely dissemination of patches nearly impossible. Although, in the past there has been one worm, the Morris worm, which exploited an unknown vulnerability, most of the current attacks target recently published, unpatched vulnerabilities [11, 47]. Moreover, due to the ready availability of attack crafting tools [83], it has become extremely straightforward to generate variations of exploit code with minimal effort.

132 In order to defend against such attacks, network administrators and security analysts rely on the commercial-off-the-shelf (COTS) tools such as security scanners and vulnerability assessment tools, which periodically examine the network for services whose software do not conform to the specified security recommendations. However, these tools act just as an interface for patch man- agement, and do not report the presence of recently published vulnerabilities, especially when the vendor specified patches are not available. The task of assimilating recently published vulnerabil- ities that affect a given network configuration is a tedious one. Typically, this is done by network administrators who manually dig across various external bug tracking databases to obtain the set of vulnerability reports relevant to their network configuration. Also, as the schematic format of the various bug tracking databases is not standard, it is difficult to devise automated solutions to semantically parse and digest the information present in them. In order to generate appropriate preemptive solutions that help protect the network from the corresponding attacks until the appro- priate patches are released, it is crucial that the information contained in the reports be processed in time. Moreover, defense solutions can also be derived directly from the information contained in the aggregated vulnerability reports. For example, at the time of disclosing the Microsoft Windows Meta-File Format (WMF) [88] vulnerability, SANS provided a set of snort rules to temporarily mitigate the attack. Although, in this case the patch turn-around time was relatively quick (less than a week), nevertheless these signatures could have been used for preventing the attacks, especially during the vulnerable gap. As a light-weight defense solution, AEGIS, a mechanism to assess and protect the network from recently published vulnerabilities is proposed. In order to reduce the assessment overhead, only those vulnerabilities that are applicable to the given network configuration are considered. For this purpose, at the time of deployment, configuration details of all services running on the different hosts of the network are encapsulated using a Service Definition Language (SDL). Fur- thermore, an extensible defense-oriented representation schema (EDORS) is proposed to represent vulnerabilities in a machine oriented format. This schema is then used by the policy engine to generate detection rules that can be accommodated in any firewall or NIDS capable of performing

133 deep-packet inspection. The final sequence of rules is derived by verifying and evaluating the con- sequence of applying the generated rules in accordance with the adopted organizational policy. The feasibility to generate such defense rules is tested using six different vulnerabilities. Experimental evaluation shows that it is possible to generate and apply NIDS rules in an effective manner with minimal overhead, especially when enough information is present in the vulnerability reports.

6.1.2 Contributions

As a first line of defense, AEGIS offers a just-in-time (JIT) vulnerability management solution with the following advantages over other existing schemes.

• An automated mechanism for determining the vulnerabilities pertinent to a network/system configuration using the information aggregated from various external bug tracking reposito- ries. A total of six different external repositories is considered for this purpose.

• An Extensible Defense Oriented Representation Schema (EDORS) to store the vulnerability reports aggregated from external sources in a standardized format so that they can be later processed by the policy engine to generate appropriate IDS signatures.

• Generation of organizational policy based IDS signatures that select the best sequence of action that needs to be adopted until the vendor specified patch is available.

To make the detection technique more accurate and less intrusive, AEGIS can be extended to incorporate NIDS signatures generated by programming language approaches like [76] and [94].

6.1.3 Chapter Organization

The rest of the chapter is organized as follows: The design goals are given in Section 6.2. The AEGIS architecture details are presented in Section 6.3. Experimental evaluation and the core results are given in Section 6.4. The concluding discussion is provided in Section 6.5.

134 6.2 Design Goals

The following three main design goals are identified to be incorporated into AEGIS:

• Minimize the vulnerability aggregation overhead – The vulnerability aggregator should download only the relevant vulnerabilities that have been recently added or updated in the repositories in a machine-oriented format. While this is trivial with the National Vulnerabil- ity Database (NVD) [95], which lists all the newly added and modified vulnerabilities in a XML format, most other sources maintain the information across HTML pages that are hard to process. Even though there exist tools like MBSA [86], Nessus [41] and GFI LANGuard Network Security Scanner [54], which perform vulnerability assessment, they, however, are used to only check for missing patches, service packs and vulnerable software, and are oblivious toward vulnerabilities for which vendor specified patches are not available. Also, the external sources must be scanned periodically to keep the local database in sync.

• Generation of organizational policy based IDS signatures – The information presented in the vulnerability reports should be coupled with the organizational policy to generate appropriate defense signatures, which act as stop-gap solutions that help protect the system until the vendor specified patches are available. Moreover, the generated rules should be able to interoperate with the existing NIDS/firewall so that they can be revoked when the corresponding patches are released.

• Flexible deployment requirements – AEGIS should be able to run across multiple operat- ing systems (i.e., Windows, Linux, others). This along with the goal of being able to easily customize the code drove the decision to use Java as the programming language. Snort was chosen because it also met the cross platform design goal, but AEGIS can be enhanced to use other NIDS platforms as long as it allows rules to be updated programmatically.

135 6.3 Architecture Overview

AEGIS is implemented on a dedicated host, which acts as a central agent that monitors other nodes in the network. Such a level of abstraction is essential in order to avoid placing the complex functionality in each of the end-hosts. The architecture block diagram of AEGIS and its various components are illustrated in Figure 6.1.

• AEGIS Scanner – In order to determine the relevant vulnerabilities, network administrators need to know about the deployed services in every corner of their network. The application details of the underlying services also need to be captured. Usually, this is done via active or passive scanning. The major difference between the two approaches is that while the active scanning techniques are accurate, as opposed to passive scanning techniques, they are noisy thereby increasing the network traffic. For the purpose of keeping the design simple, the AEGIS scanner delegates the burden of reporting the configuration details to the central agent onto the end-host itself. Even though it is possible for the end-host to report false information about the running services, verifying this is straightforward, and is not addressed here. Periodically, every end-host encapsulates all the running services using a service definition language (SDL), and presents it to the scanner. The scanner then prepares a combined configuration report of the whole network by taking the union of individual configurations. A local table is also maintained that maps the configuration information to the hostname or the IP address of the end-host. The SDL consists of the following classes: (i) appInfo – This class captures the application names and current versions of that particular application along with the underlying operating system details. It also contains the IP address or hostname to identify the host. (ii) servInfo – The servInfo class lists the services that are running on the system, their underlying protocol base, i.e., TCP, UDP or ICMP and their corresponding ports. This feature is required to bind the services running on different hosts to the appropriate ports and protocol bases. The operating system information and the services can be sniffed also using p0f [130] and libpcap [4] programs.

136 Figure 6.1: Architecture Block Diagram of AEGIS

• AEGIS Database – Existing bug tracking repositories employ proprietary formats that are disjoint and varied. Although diversity of such information is useful, the lack of common vulnerability representation format across such sources, aggravated by the lack of auto- mated assessment solutions, makes it difficult to automatically process the contained infor- mation and generate remedial actions. For example, linux-ftpd-ssl buffer overflow vulnera- bility (CVE-2005-3524) information, represented across two popular vulnerability databases varies significantly, especially with regard to the information about the vulnerable operating system platforms. To overcome these limitations, a Extensive Defense Oriented Representa- tion Schema (EDORS) is proposed, which precisely represent the vulnerabilities aggregated from multiple sources. The three essential components of EDORS are: (i) Vulnerability Pre- conditions – It is important to identify and represent the vulnerable package along with the other environmental factors that influence its exploitation. Moreover, such information can be parsed by the scanner in order to determine system specific vulnerabilities. Reference

137 numbers from other bug tracking repositories can be also used to synchronize and correlate different attributes of any particular vulnerability. (ii) Impact Details: For the purpose of applying appropriate defense measures, it is important to assess the impact of the vulnera- bilities in a given organization. This information is crucial in determining whether the pro- posed defense action violates any underlying organization’s policy. (iii) Remedial Actions: Any suggestive remedial actions that are known at the time of vulnerability disclosure need to be incorporated in the report itself so that they can be adopted directly by the automated defense mechanisms.

XML format is used to represent and specify the vulnerabilities according to the proposed EDORS format. XML provides the necessary flexibility, interoperability and portability needed in the efficient dissemination of such information. Also, availability of schema inde- pendent XML parsers make it feasible for customizing representation formats [80]. Figure 6.2 shows the XML representation for linux-ftpd-ssl vulnerability, which was reported on 11/07/2005. The included remedial actions constitute information available at the time of disclosure.

The update and synchronization agent is responsible for keeping the local bug database up- to-date with the external repositories. The updater periodically connects to the ad hoc online sources to assimilate newly added or modified vulnerability reports. The interval of the update process is dependent on the criticality of the system under consideration. Also, the update process is done in a stateful manner, i.e., while retrieving information, only the reports after the last update timestamp are processed. To ensure consistency and accuracy, the access to the AEGIS database by the scanner and updater is made mutually exclusive.

• Adaptive Policy Generator (APG) – The adaptive policy generator acts on the input pro- vided by the AEGIS scanner and the aggregated vulnerability reports to generate IDS signa- tures. Also, APG ensures that the generated rules are not in conflict with the organizational operation policy. For example, the implemented rule should not block all incoming HTTP

138 Figure 6.2: An Example Vulnerability Report in EDORS Format 139 requests to port 80 for a vulnerability having a low severity rating, especially when the func- tioning of the organization is dependent on the incoming HTTP traffic.

The decision problem of APG is to determine the action that needs to be taken at various time intervals such that the expected risk and the performance cost is minimized. The proposed actions can vary from performing deep packet inspection on all the byte combinations of the incoming packets incurring large CPU time, to performing selective packet sampling, based on presumable suspicion or hypothesis, thereby risking exposure. It is assumed that the associated risk (or the probability of a vulnerability being exploited) changes over a period of time. Also, at the beginning of each time period, the APG varies its action to respond to these changes. The action taken in the current time period does not depend on the action(s) in previous time period(s). Moreover, each action has an associated cost. The detailed model is described as follows.

Let t = {1, 2,..., T} denote the decision period before the patch is released. Let A =

{a1, a2,..., an} denote a set of actions that can be taken for a given vulnerability. These ac- tions are attuned with the capability of the IDS. The various actions could be to either drop,

allow, or log the packets depending upon the specification. Let sai denote the effectiveness of action ai, ∀ai ∈ A. Each sai is viewed as risk reduction due to the implementation of action ai. Let cai denote the cost incurred due to the enforcement of action ai. Usually, an action will restrict the access to certain services or functions in some degree (the worst scenario is to block all accesses to the service). Such restrictions affect system availability and quality of

service (QoS), thus introducing a business or service loss. Each cai can be measured either in monetary terms or in terms of the computational overhead needed to enforce the action. Let r(β, t) denote the risk of the vulnerability at time t. β is a set of parameters used to describe the vulnerability. The risk r(β, t), is initially low as a vulnerability is reported, but increases exponentially until the patch is released. Let I(t) denote the potential loss due to the exploit of the vulnerability at time period t. Again, I(t) can be measured in monetary terms. Also,

140 r(β, t) and I(t) can be fed into the AEGIS as guesstimates, depending upon the criticality

of the underlying network service, and the host on which it is running. Let xtai denote the

decision variable. If an action ai is implemented at time t, then xtai = 1; otherwise, xtai = 0. Thus the objective is to minimize the total cost by choosing a proper action in different phases: XT min ([1 − xtai sai ]r(β, t)I(t) + xtai cai ) t=1 (6.1)

s.t. xtai ∈ {0, 1}, ∀ai ∈ A

Since the time periods are independent, equation 6.1 can be decomposed into a set of sub- problems as follows. For any t:

min([1 − xtai sai ]r(β, t)I(t) + xtai cai ) (6.2)

s.t. xtai ∈ {0, 1}, ∀ai ∈ A

A greedy search heuristic is proposed to solve the corresponding problem as given in Algo- rithm 4.

input : r(β, t), I(t), A, c(a), s(a) ∀a ∈ A

output: xtai denoting the best action ai taken at time t = 1,... , T.

1 for t ← 1 to t do

2 set xta1 ← 1; 3 set min_cost ← 1 − s(a1) ∗ R(t) ∗ I(t) + c(a1); 4 foreach ai ∈ A do 5 set temp ← 1 − s(ai) ∗ R(t) ∗ I(t) + c(ai); 6 if temp < min_cost then 7 set min_cost ← temp;

8 set xtai ← 1;

9 set remaining xta j ← 0 such that 0 ≤ j ≤ n and i , j; 10 end 11 end 12 end

Algorithm 4: Greedy algorithm to select the appropriate action ai to enforce at time t so that the overall cost is minimized

141 6.4 Experimental Evaluation

6.4.1 Feasibility to Generate NIDS Rules

The downloaded vulnerability reports are processed by the APG to generate corresponding snort signatures. For the accurate detection of malicious packets, it is essential that the information about the exploit payload is present in the reports. In absence of such information, APG collaborates with the security policy to adopt a high level action, i.e., to either drop or allow all the incoming pack- ets. Since verification of the APG requires organizational specific information, such as the security policy, the stimulus to recently published vulnerabilities, and the associated risk metrics, which are difficult to obtain, the default action ai is set to drop all the incoming packets towards the vulnera- ble service at any time t. Estimation of cost and associated risk for a particular action taken in the vulnerable phase is not very difficult. There are works that focus on assisting system administra- tors to estimate the risk and the cost metrics from a socio-economic perspective [12, 16, 25, 116]. Subsequently, to show the efficacy of AEGIS, it is tested against six different vulnerabilities by generating their corresponding snort rules. Linux-ftpd-ssl Vulnerability (CVE-2005-3524): The SSL-ready version of linux-ftpd (linux-ftpd-ssl) 0.17 allows remote attackers to execute arbitrary code by creating a long directory name and then executing the XPWD command which gives root access on the system. In order to prevent such a severe attack (severity rating-10; NVD), abnormally large FTP con- trol packets having content string AUTH SSL should not be allowed and the network connection should be dropped. Hence a simple snort rule for detecting this vulnerability is: reject tcp $ EXTERNAL_NET any -> $ VUL-LINUX-FTPD-SERVERS 21 (content: "AUTH SSL"; dsize=64;)

Apache TomCat Vulnerability (CVE-2005-0808): Apache Tomcat before 5.x allows remote attackers to cause a denial of service (application crash) via a crafted AJP12 packet to the TomCat HTTP port. In the absence of specific information about the packet structure, all AJP12 packets should be considered as tainted and should not be delivered to the end-hosts that are running the

142 TomCat server. This could be further strengthened by including rules that give more specific in- formation to the filtering criterion. However the former rule should be used as a protection against zero-day exploits. The corresponding snort rule is given as follows. reject tcp $ EXTERNAL_NET any -> $ VUL-TOMCAT-SERVERS $ TOMCAT_PORT (content:"/ AP12/I";)

Microsoft PnP vulnerability (CVE-2005-1983): The stack-based buffer overflow in the Plug and Play (PnP) service for Microsoft Windows 2000 and Windows XP Service Pack 1 is used by remote attackers to execute arbitrary code via a crafted packet, and local users to gain privi- leges via a malicious application, as exploited by a number of worms including the Zotob (Mytob) worm. Given the severity of the vulnerability it is essential that all packets destined for the vul- nerable end-hosts on port 445 used by the PnP service be dropped. Such an action may affect the availability severely, but may be indispensable in the face of the security hazards it exposes. drop tcp $ EXTERNAL_NET any -> $ VUL-PNP-HOSTS 445

Microsoft SQL Server 2000 vulnerability (CVE-20020649): Multiple buffer overflows in SQL Server 2000 Resolution Service allow remote attackers to cause denial of service or exe- cute arbitrary code via UDP packets to port 1434 in which (1) a 0x04 byte causes the SQL Monitor thread to generate a long registry key name, or (2) a 0x08 byte with a long string causes heap corruption. To prevent the spread of exploits making use of this vulnerability, packets destined to vulnerable hosts matching these criteria should be dropped. drop udp $ EXTERNAL_NET any -> $ VUL-SQL-SERVERS 1434 (content:"|04|"; depth:1; content:"|81 F1 03 01 04 9B 81 F1 01|"; content: "sock"; content:"send";)

Microsoft Indexing Service 2000 vulnerability (CVE-2004-1008): Buffer overflow in ISAPI extension (idq.dll) in Index Server 2.0 and Indexing Service 2000 in IIS 6.0 beta and earlier versions allow remote attackers to execute arbitrary commands via a long argument to Inter- net Data Administration (.ida) and Internet Data Query (.idq) files such as default.ida, as commonly exploited by Code Red. All URL requests directed towards "\inetpub\scripts\root.exe" on the vul- nerable hosts should be dropped.

143 drop tcp $ EXTERNAL_NET any -> $ VUL-HTTP-SERVERS $ HTTP_PORT (content: "\inetpub \scripts\root.exe";)

Putty SSH2_MSG_DEBUG vulnerability (CVE-2001-0500): Integer signedness error in the ssh2_rdpkt function in PuTTY prior to version 0.56 can allow remote attackers to execute arbitrary code via a SSH2_MSG_DEBUG packet with a modified stringlen parameter causing a buffer overflow. Considering the severity of the vulnerability (severity rating 10; NVD), all SSH2_MSG_DEBUG pack- ets should be dropped. drop tcp $ EXTERNAL_NET 22 -> $ VUL-PUTTY-HOSTS any (content: "SSH2_MSG_DEBUG";)

It should be noted that there may be vulnerabilities reports without any defense related infor- mation. Ideally, such reports need to be examined by security personnel on a case-by-case basis so that appropriate defensive measure is adopted. However, this largely depends on the organi- zational policy and criticality of the service. Also, the generated IDS rules only serve as a first line of defense to protect from recently published vulnerabilities. Once a vendor specified patch is released for a particularly vulnerability, it should supersede the previously enforced IDS rule, and must be applied immediately. Also, the IDS rules need to be promptly revoked. For this purpose, AEGIS periodically examines the sources of all the downloaded vulnerability reports to see if any new security updates are notified in them.

6.4.2 Performance Overhead

In this section the performance overhead incurred by the various components of AEGIS is ana- lyzed. AEGIS Vulnerability Aggregator: Once the vulnerability reports are downloaded, the local database is periodically synchronized with the external repositories to keep the system up-to-date. Since most of the repositories are updated on an hourly basis, the time interval for synchronization is set to be 1 hour. In the last six months, an average of 16 vulnerabilities were reported on a daily among these six sources. Also, an average of 8 vulnerability reports were modified daily. As in

144 Table 6.1: Overhead incurred while downloading vulnerability reports

Repository Number of Initialization Average sync Feed vulnerabilities time (secs) time (secs) Type NVD 14649 290 3.914 XML BUGTRAQ 16190 8223 1.211 HTML USCERT 1600 200 10.788 HTML XFORCE 22276 20554 11.435 HTML SECTRACK 14347 3638 16.734 HTML SECUNIA 11304 5552 4.322 HTML

most cases the vulnerability reports are represented as HTML pages, page markers are maintained in the local cache in order to check for updates and additions. Repositories like NVD, XFORCE, etc. provides a separate feed for recently added and modified reports, thereby making the update process easy. The time taken to download the entire vulnerability reports as a one time effort and the time taken to keep the local database in sync with the six repositories is given in Table 6.1. AEGIS Adaptive Policy Generator: The total time taken by the policy generator is the sum of time taken to generate policy specific defense rules and the time taken to implement these rules in the existing NIDS. Since existing repositories do not contain information in EDORS format, a total of 345 published vulnerabilities spread across six bug repositories are converted into the EDORS format. Here, the time taken to generate IDS rules and the time taken to update Snort rule base is 2.4 secs and 1.2 secs respectively. Snort Performance: Here, the performance overhead to generate and enforce IDS rules is given. The performance overhead varies depending on the complexity of the underlying rule; certain rules inspect incoming packets on a byte-by-byte basis, where as others check for the first few bytes only. The performance degradation as a result of implementing rules for the six vulnerabilities given in Section 6.4.1 is given in Figure 6.3. It is important to note that taint analysis based approaches [76] and [94] proposed to protect against such attacks perform eight times worse when compared with AEGIS.

145 Figure 6.3: Performance overhead incurred by SNORT to generate IDS rules

6.5 Summary

Since most of the vulnerability publishing formats are not machine-oriented, it is difficult for auto- mated defense solutions to digest and extract the information contained in them. In order to address these limitations, an extensible defense-oriented representation schema (EDORS) that succinctly represents vulnerabilities contained in disparate vulnerability reports is proposed. Also, at periodic intervals, AEGIS encapsulates the snapshots of the network configuration details with the services running on the different hosts of the network using Service Definition Language (SDL). SDL in- teroperates with the risk assessment framework to quantify the risks imposed by the vulnerability on an organization at any particular period in time. Information present in the EDORS schema is used by the policy engine to generate detection rules that can be accommodated in any firewall or NIDS capable of performing deep-packet inspection. The final sequence of rules is then derived by verifying and evaluating the consequence of applying the generated rules in accordance with the specified organizational policy. Lastly, as AEGIS examine the outgoing and incoming network packets, it can be used to reject all HTTP packets towards a known phishing Website or e-mails sent from a known blacklisted relay at a network level and provide avenues for implementing attack-agnostic defense solutions.

146 Chapter 7

Conclusions and Future Work

“This is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.”

− Winston Churchill

7.1 Concluding Remarks

The main focus of this dissertation has been to address e-mail based threats arising due to lack of proper authentication, integrity and non-repudiation enforcement mechanisms in the present day SMTP protocol. In particular, due to the aforementioned limitations, three prominent threats have surfaced, viz. (i) spam; (ii) phishing; and (iii) information leak. Spam e-mails are considered as a serious annoyance to both network administrators and end users alike. On the network side, they inundate legitimate e-mail servers and network infrastructure with junk messages, drastically di- minishing their bandwidth and computation power. Users are also forced to fritter away substantial time and effort in keeping these junk e-mails from flooding their inboxes. Despite these adverse effects, spam e-mails do not pose a direct threat to users’ security and privacy. On the other hand, phishing e-mails trick victims into disclosing sensitive information to fake Websites, which mimic legitimate financial domains. As phishing attacks are a form of identity theft having the potential to inflict huge monetary damages on their victims, they are considered as the most nefarious of all e-mail based threats. Until recently, spam filters were extended to mitigate

147 phishing e-mails. Phishing e-mails having similar layout and appearance as their legitimate coun- terparts can easily bypass these generic filters. Even specialized anti-phishing solutions proposed in the literature to detect phishing e-mails overlook the important features that distinguish phish- ing e-mails from legitimate financial e-mails. As a result, their underlying classification techniques suffer from poor performance. In order to emphasize this point further, this dissertation presents a case study illustrating the inability of four popular contemporary e-mail clients in separating legiti- mate financial e-mails from phishing e-mails. Another problem with these filters is that they do not take into account visual cues indicating personalized content that are exclusive to legitimate finan- cial e-mails during classification. As most phishing e-mails are composed in bulk, unlike e-mails from legitimate financial organizations, they do not address their recipients in a direct fashion (e.g., using their last name, user id, last four digits of their account number, address, etc). In this regard, this dissertation proposes a customizable and usable spam filter (CUSP) that can detect phishing e-mails from the absence of such personalized content. CUSP is also implemented as an add-on to Microsoft Outlook, a popular e-mail client widely used by home and business users. For the purpose of getting a better feel for the type of personalized content in- cluded in legitimate e-mails, a survey of e-mails from 20 most phished organizations in 2007 was conducted. These organizations unanimously claim that they do not send e-mails to their customers requesting sensitive information. Also, most of these organizations ascertain that they refrain from mass mailing, and only send out “personalized” e-mails to the recipients. However, solely rely- ing on the personalized information contained in the e-mail body to tell phishing and legitimate e-mails apart is not entirely foolproof. A better way to detect phishing e-mails is by extracting and analyzing the underlying meaning conveyed in the e-mail body. Since phishing is primarily a social engineering attack, phishers often imply a false sense of threat or lure in the messages to make users succumb to them. This dissertation proposes to encapsulate this fake behavior exhib- ited by phishing e-mails using a set of linguistic, textual and structural features and apply them for classification. The effect of each individual feature in the classification process is analyzed for bringing out their usefulness. Three popular machine learning algorithms, viz. (i) naïve Bayesian

148 classifier; (ii) decision trees; and (iii) support vector machines were used for this purpose. Exper- imental results reveal that the proposed methodology detects phishing e-mails with high accuracy while keeping the false positive rate low. Furthermore, the underlying tone of phishing e-mails can be used to generate accurate context-sensitive warnings that educate the users about the working and ill-effects of phishing attacks. Spurious e-mail classification techniques that operate on features intrinsically contained in the e-mail body can be thwarted using simple obfuscation techniques. Especially, portions of text that are suggestive of spurious content can be substituted with seemingly innocuous non-standard characters or images so that they pass through as ham. Also, features present in a spurious e-mail’s body might not be adequate for the classification algorithms to flag it as “suspicious.” In the face of such scenarios, this dissertation takes the classification process a step further to analyze the behavior and characteristics of Websites referred by URLs contained in e-mails. To detect spoofed Websites, the technique (PHONEY) proposed in this dissertation reverses the role of victim and adversary, and evaluates the response of spoofed Websites for random inputs (or honeytokens). As phishing Websites cannot differentiate fake and genuine inputs apart, their behavior is same to both of them. This static behavior exhibited by phishing Websites serves as a vital factor in identifying them. The proposed technique was evaluated on both “live” and “synthesized” phishing Websites. In order to visualize the working of PHONEY, an ActiveX plug-in for Internet Explorer (IE) 6 browser was developed. Experimental results show that PHONEY is able to detect almost all of the phishing Websites with zero false positives. Furthermore, the net overhead incurred by PHONEY in evaluating each e-mail is also negligible, and is in the order of milliseconds. In the second part, this dissertation also takes a step further by proposing a novel technique to identify spam e-mails by analyzing the content of the linked-to Websites. Unlike spurious e-mails, as linked-to Websites do not contain any obfuscation material, they vividly reflect the original intent of the messages and can be used to determine if the corresponding e-mail is spam or not. A combination of textual and structural features extracted from the linked-to Websites is supplied as input to five machine learning algorithms for the purpose of classification. The proposed technique was able to detect

149 spam e-mails with over 95% detection rate exhibiting better performance than two popular open source anti-spam filters. Phishers can also launch context aware attacks by harvesting the personal details of their po- tential victims and using them to compose targeted e-mails that appear to come from legitimate financial institutions. Even though information about users can be acquired in a variety of ways, the most prevalent method is through spywares that hook on top of users’ browsers. These spy- wares monitor outgoing HTTP packets and information present in the browsers’ cache, and surrep- titiously leak them to external attackers. Traditional anti-spyware mechanisms focus on identifying spywares by separating the spyware and the user activity using static honeytokens. However, a new class of intelligent spyware can be devised to defeat static honeytoken based schemes by employing data mining algorithms such as associative rule mining. To counteract this threat, this dissertation proposes a randomized honeytoken based detection mechanism, which cannot be easily circum- vented even by the intelligent spyware. Another related problem is preventing information leak caused inadvertently by users. In order to prevent users from sending e-mails to unintended set of recipients, this dissertation proposes to model the past user-to-recipients communication from e-mails, and apply it to detect outliers that may occur in the future. The proposed technique fol- lows the same idea and procedure given in [24]. The focus, however, is shifted towards selecting a better set of linguistic features that boost the classifier’s performance. Finally, to safeguard from zero-day exploits that target unpatched vulnerabilities in the SMTP server software and the e-mail client, a novel technique called AEGIS is proposed. AEGIS aggre- gates vulnerability reports pertinent to a given network configuration from various external sources, and uses the information contained in them to generate IDS rules that conform with a specified or- ganization policy. Also, AEGIS is flexible enough so that it can be easily extended to protect every network service in addition to SMTP. Also, since AEGIS interacts with the IDS, it can be used to enforce attack-agnostic rules, once the misbehaving nodes are determined. In addition, experience has suggested that AEGIS can also be used as an educational tool to train system administrators to respond to zero-day attacks [31].

150 7.2 Future Directions

The research presented in this dissertation has opened up several new directions for future research. They are summarized as follows.

• The algorithms proposed to classify phishing e-mails treat each visual deception agent (trick used by the attackers to make phishing e-mails appear legitimate) present in a spoofed e-mail with equal importance. However, with the help of user studies similar to [43, 65], it may be possible to rank the various deception agents in the order of their effectiveness in fooling the users. This would enable to give more attention to highly ranked deception agents, thus enhancing the overall performance.

• The anti-phishing mechanisms only provide warnings or cues to help users decide on a valid- ity of a Website. Since it is the users who make the ultimate decision of whether to respond to a phishing e-mail or not, it may be prudent to evaluate context sensitive warning genera- tion system proposed in this dissertation with real human subjects. Also, getting feedback from them would help generate warning messages in a more user friendly manner.

• Another related area where text mining could be applied is the problem of detecting fraud- ulent sellers based on buyers’ feedback text. As e-commerce Websites like eBay, Ama- zon.com and Google Checkout allow vendors to sell their own goods through their portals, the task of identifying illegitimate products or sellers is also unavoidably thrust on them. The classification techniques presented in this dissertation can be extended to detect malicious sellers who do not deliver the promised goods in time or at all.

• One of the impeding factors in the large scale adoption of PHONEY is that while being able to detect legitimate domains correctly, it is possible for an attacker to launch denial-of- service attacks by sending emails with the URLs of real domain. To this effect, a future work would be to conduct a thorough feasibility analysis by maintaining whitelists or honeypots to avoid repetitive testing of legitimate domains.

151 • As attackers could launch phishing attacks via instant messengers (spimming) or through DNS redirection (pharming), PHONEY can be extended to address them too. This would mandate PHONEY to be developed as a full-fledged browser plug-in.

• The approach proposed in this dissertation to detect spam e-mails by analyzing the structural traits and content of linked-to Websites can be extended to cluster Websites selling similar products together. Applying an approach similar to the analysis done [10] and [74], would help understand the type of spam products advertised by a particular domain, its life-cycle and mode of operation. In addition, this would also assist in identifying misbehaving nodes (or malicious domains) and generating attack-agnostic defense rules.

• Another future work is to conduct a field study by building the local agent that generates randomized honeytokens as a browser plug-in so that it can act in conjunction with the remote IDS, thereby evaluating it on live spyware programs.

• Lastly, the possibility of using AEGIS as an educational tool to teach and assist day-to-day users with patch and vulnerability management is to be explored.

152 References

[1] McAfee Americans and Spam Survey. Retrieved on April 21st, 2009 from http://us. mcafee.com/fightspam/default.asp?id=survey.

[2] Spam Archive. Available at http://untroubled.org/spam/.

[3] The CERT/CC Statistics 1998-2007. Available at www.cert.org/stats.

[4] The Libpcap Project: Packet Capture Tool. Available at http://sourceforge.net/ projects/libpcap/.

[5] Email Evils: How Your Email Can Wreak Havoc With Spyware on Your Computer. Retrieved on March 12th, 2009 from http://whitepapers.techrepublic.com.com/ abstract.aspx?docid=178208, 2006.

[6] Symantec Global Internet Security Threat Report. Retrieved on April 21st, 2009 from http: //www4.symantec.com/Vrt/wl?tu_id=gCGG123913789453640802, 2009.

[7] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A Comparison of Machine Learning Tech- niques for Phishing Detection. In Proceedings of the APWG Annual eCrime Researchers Summit: eCrime ’07, pages 60–69, 2007.

[8] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Data Bases: VLDB ’94, volume 1215, pages 487–499, 1994.

153 [9] E. Allman, J. Callas, M. Delany, M. Libbey, J. Fenton, and M. Thomas. DomainKeys Identified Mail (DKIM) Signatures. Internet Engineering Task Force (IETF) Draft, 2006.

[10] D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In Proceedings of 16th USENIX Security Symposium, 2007.

[11] J. V. Antrosio and E. W. Fulp. Malware Defense using Network Security Authentication. In Proceedings of the Third IEEE International Information Assurance Workshop: IWIA ‘05.

[12] A. Arora, C. M. Forman, A. Nandkumar, and R. Telang. Competitive and Strategic Effects in the Timing of Patch Release. In Workshop on the Economics of Information Security: WEIS ’06.

[13] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The Berkeley FrameNet project. In Proceedings of the COLING-ACL ’98, pages 86–90, 1998.

[14] R. Balasubramanyan, V. Carvalho, and W. Cohen. CutOnce-Recipient Recommendation and Leak Detection in Action. In Proceedings of Conference on Email and Anti-Spam: CEAS ’08, 2008.

[15] BBC News. Company Secrets Leak via E-mail. Retrieved on April 21st, 2009 from http: //news.bbc.co.uk/2/hi/technology/3809025.stm, 2004.

[16] S. Beattie, S. Arnold, C. Cowan, P. Wagle, C. Wright, and A. Shostack. Timing the Appli- cation of Security Patches for Optimal Uptime. In Proceedings of the 16th USENIX Systems Administration Conference: LISA ’02, Philadelphia, PA, Dec. 2002.

[17] A. Bergholz, G. Paass, F. Reichartz, S. Strobel, and J. Chang. Improved Phishing Detec- tion Using Model-Based Features. In Proceedings of Conference on Email and Anti-Spam: CEAS ’08, 2008.

154 [18] C. Best, J. Piskorski, B. Pouliquen, R. Steinberger, and H. Tanev. Automating event extrac- tion for the security domain. In H. Chen and C. C. Yang, editors, Intelligence and Security Informatics, volume 135 of Studies in Computational Intelligence, pages 17–43. Springer, 2008.

[19] K. Borders and A. Prakash. Web Tap: Detecting Covert Web Traffic. In Proceedings of the 11th ACM Conference on Computer and Communications Security: CCS ’04, pages 110–120, 2004.

[20] K. Borders, X. Zhao, and A. Prakash. Siren: Catching Evasive Malware. In Proceedings of the 27th IEEE Symposium on Security & Privacy, pages 78–85, 2006.

[21] D. Brumley and D. Song. Towards Attack-agnostic Defenses. In Proceedings of the First USENIX Workshop on Hot Topics in Security: HOTSEC ’06, pages 57–62, 2006.

[22] J. C. Brustoloni and R. Villamarín-Salomón. Improving Security Decisions with Polymor- phic and Audited Dialogs. In Proceedings of the Third Symposium on Usable Privacy and Security: SOUPS ‘07, pages 76–85, 2007.

[23] A. Burchardt, K. Erk, and A. Frank. A WordNet Detour to FrameNet. Sprachtechnologie, Mobile Kommunikation und Linguistische Resourcen, 8:408–421, 2005.

[24] V. Carvalho and W. Cohen. Preventing Information Leaks in Email. In Proceedings of Sixth SIAM International Conference on Data Mining: SDM ’07, 2007.

[25] H. Cavusoglu, H. Cavusoglu, and J. Zhang. Economics of Security Patch Management. In Workshop on the Economics of Information Security: WEIS ‘06.

[26] M. Chandrasekaran, M. Baig, and S. Upadhyaya. AVARE: Aggregated Vulnerability As- sessment and Response against Zero-day Exploits. In Proceedings of 25th International Performance, Computing, and Communications Conference: IPCCC ’06, 2006.

155 [27] M. Chandrasekaran, M. Baig, and S. Upadhyaya. AEGIS: A Proactive Methodology to Shield against Zero-day Exploits. In Proceedings of Advanced Information Networking and Applications (AINA) Workshops (2), pages 556–563, 2007.

[28] M. Chandrasekaran, R. Chinchani, and S. Upadhyaya. PHONEY: Mimicking User Re- sponse to Detect Phishing Attacks. In Proceedings of 7th International Symposium on A World of Wireless, Mobile and Multimedia Networks: WoWMoM ’06, pages 668–672, 2006.

[29] M. Chandrasekaran, K. Narayanan, and S. Upadhyaya. Phishing E-mail Detection Based on Structural Properties. In Proceedings of 9th Annual New York State Cyber Security Con- ference, 2006.

[30] M. Chandrasekaran and S. Upadhyaya. A Multistage Framework to Defend Against Phish- ing Attacks. In M. Gupta and R. Sharman, editors, Handbook of Research on Social and Organizational Liabilities in Information Security. Information Science Reference, 2008.

[31] M. Chandrasekaran, S. Upadhyaya, N. Cambell Jr, and H. Albekan Jr. AEGIS: A Pedagog- ical Tool for Patch and Vulnerability Management. Proceedings of the 13th Colloquium for Information Systems Security Education: CISSE ‘09, 2009.

[32] M. Chandrasekaran, S. Vidyaraman, and S. Upadhyaya. SpyCon: Emulating User Activities to Detect Evasive Spyware. In Proceedings of 26th International Performance, Computing, and Communications Conference: IPCCC ’07, pages 502–509, 2007.

[33] M. Chandrasekaran, S. Vidyaraman, and S. Upadhyaya. CUSP: Customizable and Usable Spam Filters for Detecting Phishing Emails. In Proceedings of 11th Annual New York State Cyber Security Conference, 2008.

[34] N. Chou, R. Ledesma, Y. Teraguchi, D. Boneh, and J. Mitchell. Client-side Defense Against Web-based Identity Theft. In Proceedings of 11th Annual Network and Distributed System Security Symposium: NDSS ’04, 2004.

156 [35] Cognitive Science Laboratory, Princeton University. WordNet - A Lexical Database for the English Language. Available at http://wordnet.princeton.edu/, 2008.

[36] M. Corporation. Thunderbird Anti-Phishing Protection Framework. Available at http: //www.mozilla.org/projects/thunderbird/.

[37] A. Corrada-Emmanuel. Enron Email Dataset Research. Available at http://ciir.cs. umass.edu/~corrada/enron/index.html.

[38] D. Crocker, J. Leslie, and D. Otis. Certified Server Validation (CSV). Internet Engineering Task Force (IETF) Draft, 2005.

[39] D. H. Crocker. Standard for the Format of ARPA Internet Text Messages. Internet RFC 822, August 1982.

[40] O. De Vel, A. Anderson, M. Corney, and G. Mohay. Mining E-mail Content for Author Identification Forensics. ACM Sigmod Record, 30(4):55–64, 2001.

[41] R. Deraison. Nessus Security Scanner. Available at http://www.nessus.org/.

[42] R. Dhamija and J. Tygar. The Battle Against Phishing: Dynamic Security Skins. In Pro- ceedings of Symposium on Usable Privacy and Security: SOUPS ’05, 2005.

[43] R. Dhamija, J. Tygar, and M. Hearst. Why Phishing Works. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: CHI ’06, pages 581–590, 2006.

[44] C. Drake, J. Oliver, and E. Koontz. Anatomy of a Phishing Email. In Proceedings of Conference on Email and Anti-Spam: CEAS ’04, 2004.

[45] EarthLink. EarthLink Toolbar. Available at http://www.earthlink.net/software/ domore.faces?tab=toolbar.

[46] eBay. Using eBay Toolbar’s Account Guard. Available at http://pages.ebay.com/ help/confidence/account-guard.html.

157 [47] D. Ellis. Worm Anatomy and Model. In Proceedings of the ACM Workshop on Rapid Malcode: WORM ‘03, 2003.

[48] D. Evett. Spam Statistics 2006. Retrieved on April 21st, 2009 from http:// spam-filter-review.toptenreviews.com/spam-statistics.html.

[49] I. Fette, N. Sadeh, and A. Tomasic. Learning to Detect Phishing Emails. In Proceedings of the 16th International Conference on World Wide Web: WWW ’07, pages 649–656, 2007.

[50] S. Garfinkel. Email-based Identification and Authentication: An Alternative to PKI? IEEE Security & Privacy Magazine, 1(6):20–26, 2003.

[51] S. Garfinkel, D. Margrave, J. Schiller, E. Nordlander, and R. Miller. How to Make Secure Email Easier to Use. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: CHI ’06, pages 701–710, 2005.

[52] S. Garfinkel and R. Miller. Johnny 2: A User Test of Key Continuity Management with S/MIME and Outlook Express. In Proceedings of Fifth Symposium on Usable Privacy and Security: SOUPS ’05, pages 13–24, 2005.

[53] Gartner Press Releases. Gartner Survey Shows Phishing Attacks Escalated in 2007; More than $3 Billion Lost to these Attacks. Retrieved on Febraury 1st, 2008 from http://www. gartner.com/it/page.jsp?id=565125.

[54] GFI Inc. GFI LANGuard Network Security Scanner. Available at http://www.gfi.com/ lannetscan/.

[55] N. Good, R. Dhamija, J. Grossklags, D. Thaw, S. Aronowitz, D. Mulligan, and J. Konstan. Stopping Spyware at the Gate: A User Study of Privacy, Notice and Spyware. In SOUPS ’05: Proceedings of the 2005 symposium on Usable privacy and security, pages 43–52, 2005.

158 [56] Google. Google Safe Browsing for Firefox. Available at http://www.google.com/ tools/firefox/safebrowsing/.

[57] L. Greenemeier. Black Hat: JavaScript Flaws Ease Intranet Attacks. Retrieved on April 21st from http://www.informationweek.com/news/internet/showArticle. jhtml?articleID=201300295.

[58] S. Hershkop. Behavior-based Email Analysis with Application to Spam Detection. PhD thesis, Columbia University, 2006.

[59] P. Hoffman. SMTP Service Extension for Secure SMTP over Transport Layer Security. Internet RFC 3207, 2002.

[60] T. Holgers, D. E. Watson, and S. D. Gribble. Cutting Through the Confusion: A Mea- surement Study of Homograph Attacks. In Proceedings of the USENIX Annual Technical Conference: ATEC ‘06, 2006.

[61] C. Jackson, A. Bortz, D. Boneh, and J. Mitchell. Protecting Browser State from Web Privacy Attacks. In Proceedings of the 15th International Conference on World Wide Web: WWW ‘06, pages 737–744. ACM New York, NY, USA, 2006.

[62] C. Jackson, D. Simon, D. Tan, and A. Barth. An Evaluation of Extended Validation and Picture-in-Picture Phishing Attacks. In Proceedings of 11th International Conference on Financial Cryptography: FC ’07, pages 281–293, 2007.

[63] M. Jakobsson. Modeling and Preventing Phishing Attacks. In Proceedings of the 9th Inter- national Conference on Financial Cryptography and Data Security: FC ’05, 2005.

[64] M. Jakobsson and S. Stamm. Invasive Browser Sniffing and Countermeasures. In Proceed- ings of the 15th International Conference on World Wide Web: WWW ‘06, pages 523–532. ACM New York, NY, USA, 2006.

159 [65] M. Jakobsson, A. Tsow, A. Shah, E. Blevis, and Y.-K. Lim. What Instills Trust? A Quali- tative Study of Phishing. In Proceedings of the 11th International Conference on Financial Cryptography and Data Security: FC ’07, pages 356–361, 2007.

[66] I. Jermyn, A. Mayer, F. Monrose, M. Reiter, and A. Rubin. The Design and Analysis of Graphical Passwords. In Proceedings of the 8th USENIX Security Symposium, 1999.

[67] T. Joachims, C. Nedellec, and C. Rouveirol. Text Categorization with Support Vector Ma- chines: Learning with Many Relevant Features. In European Conference on Machine Learn- ing: ECML ‘98. Springer, 1998.

[68] C. Kalyan and K. Chandrasekaran. Information Leak Detection in Financial E-mails Using Mail Pattern Analysis Under Partial Information. In Proceedings of the 7th WSEAS Confer- ence on Applied Informatics and Communications, pages 104–109, 2007.

[69] A. Kapadia. A Case (Study) for Usability in Secure Email Communication. IEEE Security & Privacy Magazine, 5(2):80–84, 2007.

[70] C. Karlberger, G. Bayler, C. Kruegel, and E. Kirda. Exploiting Redundancy in Natural Language to Penetrate Bayesian spam filters. In Proceedings of the First USENIX Workshop on Offensive Technologies: WOOT ‘07, 2007.

[71] V. Keselj, F. Peng, N. Cercone, and C. Thomas. N-gram-based Author Profiles for Author- ship Attribution. In Proc. Pacific Association for Computational Linguistics, 2003.

[72] E. Kirda and C. Kruegel. Protecting Users Against Phishing Attacks with AntiPhish. In Proceedings of 29th Annual International Computer Software and Applications Conference: COMPSAC ‘05, volume 1, 2005.

[73] T. Kohonen. An Introduction to Neural Computing. Neural networks, 1(1):3–16, 1988.

[74] M. Konte, N. Feamster, and J. Jung. Dynamics of Online Scam Hosting Infrastructure. In Proceedings of Passive and Active Measurement Conference: PAM ’09, 2009.

160 [75] Lance Spitzner. Honeytokens: The Other Honeypot. Retrieved on March 12th 2005 from http://www.securityfocus.com/infocus/1713.

[76] Z. Liang and R. Sekar. Fast and Automated Generation of Attack Signatures: A Basis for Building Self-protecting Servers. In Proceedings of the 12th ACM Conference on Computer and Communications Security: CCS ’05, 2005.

[77] E. Lieberman and R. C. Miller. Facemail: Showing Faces of Recipients to Prevent Mis- directed Email. In Proceedings of the Third Symposium on Usable Privacy and Security: SOUPS ’07, pages 122–131, 2007.

[78] J. Lyon and M. Wong. Sender ID: Authenticating E-mail. Internet Engineering Task Force (IETF) Draft, 2004.

[79] M. Mangalindan. For Bulk E-mailer, Pestering Millions Offers Path to Profit. Wall Street Journal, 13, 2002.

[80] D. E. Mann and S. M. Christey. Towards a Common Enumeration of Vulnerabilities. http: //cve.mitre.org/docs/docs-2000/cerias.html, 1999.

[81] C. D. Manning, P. Raghavan, and H. Schutze. An Introduction to Information Retrieval. Cambridge University Press, 2009.

[82] S. Martin, A. Sewani, B. Nelson, K. Chen, and A. Joseph. Analyzing Behaviorial Features for Email Classification. In Proceedings of Conference on Email and Anti-Spam: CEAS ’05, 2005.

[83] D. Maynor and K. Mookhey. Metasploit Toolkit for Penetration Testing, Exploit Develop- ment, and Vulnerability Research. Syngress Press, 2007.

[84] McAfee. Unwanted Programs: Spyware and Adware, 2005.

161 [85] Microsoft. Microsoft Admits to Flaw in Windows Patch. Retrieved on December 21st 2006 from http://news.zdnet.co.uk/software/windows/0,39020396,39193434, 00.htm.

[86] Microsoft. Microsoft Baseline Security Analyzer 2.0. Retrieved on April 21st 2009 from technet.microsoft.com/en-us/security/cc184921.aspx.

[87] Microsoft. Microsoft Office Outlook 2003 Junk E-mail Filter With Microsoft SmartScreen Technology. White Paper Available at download.microsoft.com/download/0/d/e/ 0deb54cc-6b7a-43f9-bcc3-34769d40b929/emailfilter.doc.

[88] Microsoft. WMF Vulnerability - Microsoft Security Advisory (912840). Retrieved on January 2006 from http://www.microsoft.com/technet/security/advisory/ 912840.mspx.

[89] T. Moore, R. Clayton, and H. Stern. Temporal Correlations between Spam and Phishing Websites. In Proceedings of the Second USENIX Workshop on Large-Scale Exploits and Emergents Threats: LEET ‘09, 2009.

[90] A. Moshchuk, T. Bragin, S. Gribble, and H. Levy. A Crawler-based Study of Spyware on the Web. In Proceedings of Network and Distributed System Security Symposium: NDSS ‘06, pages 17–33, 2006.

[91] J. Nazario. Phishingcorpus homepage, march 2008. Available at http://monkey.org/ %7Ejose/wiki/doku.php?id=PhishingCorpus.

[92] B. Nelson, M. Barreno, F. J. Chi, A. D. Joseph, B. I. P. Rubinstein, U. Saini, C. Sutton, J. D. Tygar, and K. Xia. Exploiting Machine Learning to Subvert your Spam Filter. In Proceedings of the First Usenix Workshop on Large-Scale Exploits and Emergent Threats:

LEET ’08, 2008.

[93] NetCraft. NetCraft Anti-phishing Toolbar. Available at toolbar.netcraft.com, 2004.

162 [94] J. Newsome and D. Song. Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. In Proceedings of Network and Distributed Systems Security Symposium: NDSS ’05, 2005.

[95] NIST. National Vulnerability Database. Retrieved from http://www.nvd.nist.gov/.

[96] J. Piskorski, M. Sydow, and D. Weiss. Exploring Linguistic Features for Web Spam Detec- tion: A Preliminary Study. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 25–28, 2008.

[97] J. B. Postel. Simple Mail Transfer Protocol. Internet RFC 821, August 1982.

[98] V. Prakash. Vipul’s Razor. Available at http://razor.sourceforge.net.

[99] J. Quinlan. C4. 5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[100] Radicati Group. Number of E-mail Users Worldwide to Reach 1.6 Billion in 2011. Retrieved January 10th, 2009 from http://software.tekrati.com/research/9512/.

[101] A. Ramachandran, N. Feamster, and S. Vempala. Filtering Spam with Behavioral Black- listing. In Proceedings of the 14th ACM Conference on Computer and Communications Security, 2007.

[102] B. Ramsdell et al. S/MIME Version 3 Message Specification. Internet RFC 2633, 1999.

[103] B. Ross, C. Jackson, N. Miyake, D. Boneh, and J. C. Mitchell. Stronger password authenti- cation using browser extensions. In Proceedings of the 14th Conference on USENIX Security Symposium, 2006.

[104] D. Sculley. Advances in Online Learning-based Spam Filtering. PhD thesis, Tufts Univer- sity, 2008.

163 [105] S. Sheng, L. Broderick, J. Hyland, and C. Koranda. Why Johnny Still Can‘t Encrypt: Eval- uating the Usability of Email Encryption Software. In Proceedings of the 6th Symposium on Usable Privacy and Security: SOUPS’ 06, 2006.

[106] S. Sheng, B. Magnien, P. Kumaraguru, A. Acquisti, L. Cranor, J. Hong, and E. Nunge. Anti- Phishing Phil: The Design and Evaluation of a Game that Eeaches People Not to Fall for Phish. In Proceedings of the 3rd Symposium on Usable Privacy and Security: SOUPS ‘07, pages 88–99. ACM New York, NY, USA, 2007.

[107] J. Shetty and J. Adibi. The Enron Email Dataset Database Schema and Brief Statistical Report. Information Sciences Institute Technical Report, University of Southern California, 2004.

[108] S. Sinha, M. Bailey, and F. Jahanian. Shades of Grey: On the Effectiveness of Reputation- based “Blacklists". In Proceedings of the Third International Conference on Malicious and Unwanted Software: MALWARE ‘08, pages 57–64, 2008.

[109] SpamAssassin. SpamAssassin Dataset. Available at http://spamassassin.apache. org/publiccorpus/.

[110] SpamAssassin. The Apache SpamAssassin Project. Available at http://spamassassin. apache.org, 2005.

[111] Spoofstick. Spoofstick Anti-Phishing Toolbar. Available at www.spoofstick.com, 2004.

[112] B. Sullivan. Who Profits from Spam? Surprise. Retrieved on April 21st, 2009 from http: //www.msnbc.msn.com/id/3078642/.

[113] X. Suo, Y. Zhu, and G. Owen. Graphical Passwords: A Survey. In Proceedings of the 21st Annual Computer Security Applications Conference: ACSAC ’05, pages 462–472, 2005.

164 [114] M. Szydlowski, C. Kruegel, and E. Kirda. Secure Input for Web Applications. In Proceed- ings of Annual Computer Security Applications Conference: ACSAC ‘07, pages 375–384, 2007.

[115] G. Team et al. GNU Privacy Guard Software. Available at www.gnupg.org.

[116] R. Telang and S. Wattal. An Empirical Analysis of the Impact of Software Vulnerabil- ity Announcements on Firm Stock Price. IEEE Transactions on Software Engineering, 33(8):544–557, August 2007.

[117] J. Tygar and A. Whitten. WWW Electronic Commerce and Java Trojan Horses. In Proceed- ings of the Second USENIX Workshop on Electronic Commerce, 1996.

[118] V. Vapnik. The Nature of Statistical Learning Theory. springer, 2000.

[119] J. Viega and G. McGraw. Building Secure Software: How to Avoid Security Problems the Right Way. Addison-Wesley, 2001.

[120] A. Whitten and J. D. Tygar. Why Johnny Can‘t Encrypt: A Usability Evaluation of PGP 5.0. In Proceedings of the 8th USENIX Security Symposium, 1999.

[121] S. Wiedenbeck, J. Waters, J. Birget, A. Brodskiy, and N. Memon. Authentication Using Graphical Passwords: Basic Results. In Proceedings of the 11th Human-Computer Interac- tion International Conference: HCII ’05, 2005.

[122] Wikipedia. Phishing. Retrieved on April 2nd, 2009 from http://en.wikipedia.org/ wiki/Phishing.

[123] G. L. Wittel and S. F. Wu. On Attacking Statistical Spam Filters. In Proceedings of Confer- ence on Email and Anti-Spam: CEAS ’04, 2004.

[124] M. Wong and W. Schlitt. Sender Policy Framework (SPF) for Authorizing Use of Domains in E-Mail, Version 1. Internet RFC 4408, 2006.

165 [125] M. Wu. Fighting Phishing at the User Interface. PhD thesis, Massachusetts Institute of Technology, 2006.

[126] M. Wu, R. Miller, and G. Little. Web Wallet: Preventing Phishing Attacks by Revealing User Intentions. In Proceedings of the Second Symposium on Usable Privacy and Security: SOUPS ‘06, pages 102–113, 2006.

[127] E. Ye, Y. Yuan, and S. Smith. Web Spoofing Revisited: SSL and Beyond. Technical report, TR2002-417, Department of Computer Science, Dartmouth College, 2002.

[128] K. Yee and K. Sitaker. Passpet: Convenient Password Management and Phishing Protection. In Proceedings of the Second Symposium on Usable Privacy and Security: SOUPS ‘06, pages 32–43, 2006.

[129] C. Yue and H. Wang. Anti-Phishing in Offense and Defense. In Proceedings of Annual Computer Security Applications Conference: ACSAC ‘08, pages 345–354, 2008.

[130] M. Zalewski. p0f Passive OS Fingerprinting Tool. Available at http://lcamtuf. coredump.cx/p0f.shtml.

[131] J. A. Zdziarski. Ending Spam: Bayesian Content Filtering and the Art of Statistical Lan- guage Classification. No Starch Press, San Francisco, CA, USA, 2005.

[132] Y. Zhang, S. Egelman, L. Cranor, and J. Hong. Phinding Phish: An Evaluation of Anti- Phishing Toolbars. In Proceedings of Network & Distributed System Security Symposium: NDSS ’07, 2007.

[133] Y. Zhang, J. Hong, and L. Cranor. Cantina: A Content-based Approach to Detecting Phish- ing Web Sites. In Proceedings of 16th International Conference on World Wide Web: WWW ‘07, pages 639–648, Banff, Alberta, Canada, 2007.

166 [134] L. Zhou, J. Burgoon, J. Nunamaker, and D. Twitchell. Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communica- tions. Group Decision and Negotiation, 13(1):81–106, 2004.

[135] P. Zimmermann. The Official PGP User’s Guide. MIT Press Cambridge, MA, USA, 1995.

167 Vita

May 23, 1981 ...... Born - Sivakasi, Tamil Nadu, India June 2002 ...... B.E., Computer Science and Engineering, University of Madras, Chennai, India. May 2003 - June 2004 ...... Research Assistant, Center of Excellence in Information Systems Assurance Research and Education, University at Buffalo, SUNY. September 2004 ...... M.S. Computer Science, University at Buffalo, SUNY. September 2004 - June 2008 ...... Teaching Assistant, Department of Computer Science and Engineering, University at Buffalo, SUNY. May - August 2005 ...... Intern, Ads Backend Group, Google Inc, Mountain View, USA. May - August 2006 ...... Intern, Payment Fraud, Google Inc, Mountain View, USA. May - August 2007 ...... Research Assistant, NSF-Cisco Wireless Security Labora- tory, The University at Buffalo, SUNY. May - August 2008 ...... Intern, Transaction Risk Management System Group, Ama- zon Inc. September 2008 - present ...... Research Assistant, Intel Grant, University at Buffalo, SUNY.

Research Publications Journals • Chandrasekaran, M., Chinchani, R. and Upadhyaya, S. “Towards A Host-Based Masquer- ade Detection Using Sequential Hypothesis Testing”, Under review in Journal of Computer Security (JCS). • Chandrasekaran, M., Upadhyaya, S. and Baig, M. “Automated Vulnerability and Response against Novel Attacks: A System Administrator’s Perspective”, Under review in Elseiver Journal on Computer Standards & Interfaces (CSI). • Husain, M.I., Chandrasekaran, M., Upadhyaya, S. and Sridhar, R. “Cross-layer Soft Security Framework for Wireless Embedded Systems”, Under review in special issue of Journal of Software. Book Chapters

168 • Chandrasekaran, M. and Upadhyaya, S. “A Multistage Framework to Defend Against Phish- ing Attacks”, Handbook of Research on Social and Organizational Liabilities in Information Security Vulnerability Aggregation and Forensics, M. Gupta and R. Sharman (Eds.), IGI Global, 2009.

• Chandrasekaran, M., Shankaranarayanan, V. and Upadhyaya, S. “Inferring Sources of Leaks in Document Management Systems”, IFIP International Federation for Information Process- ing, Advances in Digital Forensics III: eds. I. Ray and S. Shenoi, (Boston: Springer), 2009.

Conferences and Workshops

• Chandrasekaran, M., and Upadhyaya, S., “AEGIS: A Pedagogical Tool for Vulnerability and Patch Management”, to appear in The 13th Colloquium for Information Systems and Security Education (CISSE), June 1-3 2009, Seattle, WA, USA

• Ha, D., Ngo, H. and Chandrasekaran, M. “CRESTBOT: A New Family of Resilient Bot- nets”, IEEE Global Communications Conference (Globecom), November 30 - December 4 2008, New Orleans, Louisiana, USA.

• Husain, M.I., Upadhyaya, S. and Chandrasekaran, M. “A Novel Approach for Security and Robustness in Wireless Embedded Systems”, Software Technologies for Embedded and Ubiquitous Systems, 6th IFIP WG 10.2 International Workshop, SEUS 2008, Lecture Notes in Computer Science, Springer, 2008.

• Chandrasekaran, M., Shankaranarayanan, V. and Upadhyaya, S. “CUSP: Customizable and Usable Spam Filters to Detect Phishing Attacks”, 11th Annual New York State Cyber Secu- rity Conference, June 4, 2008, Albany, New York, USA.

• Shankaranarayanan, M., Chandrasekaran, M. and Upadhyaya, S. “Towards Modeling Trust Based Decisions: A Game Theoretic Approach”, 12th European Symposium on Research In Computer Security (ESORICS), September 24-26 2007, Dresden, Germany.

• Shankaranarayanan, M., Chandrasekaran, M. and Upadhyaya, S. “Position: The User is the Enemy”, IEEE New Security Paradigms Workshop (NSPW), September 18-21 2007, New Hampshire, USA,

• Chandrasekaran, M., Baig, M. and Upadhyaya, S. “AEGIS: A Proactive Methodology to Shield Against Zero-day Exploits”, 21st International Conference on Advanced Information Networking and Applications (AINA 2007), Workshops Proceedings, Volume 2, May 21-23 2007, Niagara Falls, Canada.

• Chandrasekaran, M., Shankaranarayanan, V. and Upadhyaya, S. “Spycon: Emulating User Activities to Detect Evasive Spyware”, Second International Swarm Intelligence & Other Forms of Malware Workshop (Malware 2007), IEEE IPCCC 2007, April 11-13 2007, New Orleans, Louisiana, USA (RSA Best Paper Award).

169 • Chandrasekaran, M., Chinchani, R. and Upadhyaya, S. “PHONEY: Mimicking User Re- sponse to Detect Phishing Attacks", Second International Workshop on Trust, Security and Privacy for Ubiquitous Computing (TSPUC), affiliated with IEEE WoWMoM, June 26 2006 Buffalo, New York, USA.

• Chandrasekaran, M., Narayanan, K. and Upadhyaya, S. “Phishing Email Detection based on Structural Properties”, 9th Annual New York State Cyber Security Conference, June 14 2006, Albany, New York, USA.

• Chandrasekaran, M., Baig, M. and Upadhyaya, S. “AVARE: Automatic Vulnerability Ag- gregation and Response against Zero-day Attacks”, First International Swarm Intelligence & Other Forms of Malware Workshop (Malware 2006), IEEE IPCCC 2006, April 10-12 2006, Phoenix, Arizona, USA.

• Virendra, M., Jadliwala, M., Chandrasekaran, M. and Upadhyaya, S. “Quantifying Trust in Mobile Ad-Hoc Networks", IEEE International Conference Integration of Knowledge Intensive Multi-Agent Systems (KIMAS), April 18-21 2005, Boston, Massachusetts, USA.

• Chinchani, R., Muthukrishnan, A., Chandrasekaran, M. and Upadhyaya, S. “RACOON: Rapidly Generating User Command Data For Anomaly Detection From Customizable Tem- plates”, 20th Annual Computer Security Applications Conference, December 6 - 10, 2004, Tuscon, Arizona, USA.

• Hariharan, V. G., Bhuvaneswari, R., Chandrasekaran, M. and Venugopal, A. “Paralleliz- ing Probability based Protein Sequence Clustering Using Intelligent Job Allocation Mecha- nism”, Summer Computer Simulation Conferences (SCSC), Montreal, Canada.

• Hariharan, V. G., Bhuvaneswari, R. and Chandrasekaran, M. A Distributed Algorithm to Align Distantly Related Protein Sequence Using Profile Analysis", IEEE TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region, October 15-17 2003, Ban- galore, India. Under Submission • Chandrasekaran, M. and Upadhyaya, S. “Phishing Email Detection and Warning Using Lin- guistic Features”, Under preparation for submission to IEEE Transactions on Information Forensics and Security.

• Chandrasekaran, M. and Upadhyaya S. “Detecting Spam Emails Using Linked-to Website Analysis”, under preparation.

Professional Service Technical Program Committee Member of APWG eCrime 2008, 2009, IEEE Globecom 2009. Reviewer of IWIA 2005, DSN 2007, DSN 2009, SRDS 2007, SRDS 2008, IEEE Network Maga- zine.

Research Interests

170 Email Based Security, Anti-Phishing, Malware Analysis, Operating System and Network Se- curity, Vulnerability Analysis, Intrusion Detection, Machine Learning and Intrusion Forensics. Specific topics of interests include:

• Applying machine learning based techniques to classify phishing emails using linguistic analysis.

• Detecting phishing and spam websites using structural and layout analysis.

• Addressing information leak and author attribution problem in current email infrastructure.

• Detecting presence of evasive spyware using honeytokens.

• User command based masquerade detection.

• Trust modeling in mobile ad hoc networks (MANETS).

171